Supporting Cross-Platform Research in Media Cloud

Credits

By Aashka Dave (with contributions from Rahul Bhargava, Dennis Jen, Emily Ndulue and the rest of the Media Cloud team)

Published

July 21, 2020

We’re excited to share our initial phase of support for cross-platform analysis within Media Cloud’s Topic Mapper tool. Starting today, you’ll notice new features and capabilities within Topic Mapper, namely the ability to ingest content to discover links shared on the open web, Reddit, Google search results, and from a self-uploaded CSV (that can also contain social media posts).

This post summarizes the tech changes and new data Media Cloud now provides and discusses how to use our enhanced version of Topic Mapper.

Since Media Cloud started in 2011, our tools have evolved to reflect new research questions, methodologies and technical abilities. So far, our research and technology has focused on news on the open web, tracking stories using RSS feeds and link networks. This latest update to Topic Mapper helps us further Media Cloud’s work in that space while also growing our tools to incorporate finding stories in different ways, from different places.

For many years now, mainstream news media has not been the only place where individuals can find information on the internet. This is especially relevant to the domain of research on mis/disinformation and hate speech, where we have been conducting more work lately. To that effect, we have been working for some time now on enhancing Media Cloud so it can incorporate links shared beyond the open web.

The need for this capability was underscored for us at the most recent Exploring Media Ecosystems Conference, hosted in early 2020 by Media Cloud’s non-profit arm, the Media Ecosystems Analysis Group. Though there are many researchers working on cross-platform questions, most technology in the space (including our own) is still built for one-off projects.

We took this first-pass at cross-platform topics to begin addressing those concerns and to provide a solid foundation upon which we — and others — could build. Leveraging our pre-existing strength in open web analysis, we decided to start by ingesting content from multiple platforms to extract URLs people are sharing online (in other words, we’re not studying the content itself). We used a plugin-based approach to discovering links shared on different platforms to build out cross-platform support.

Topic Mapper already provides three distinct advantages: (1) increasing a subject’s corpus of study by spidering to find more content, (2) letting you slice and dice along various parameters to analyze subtopics, and (3) and incorporating Facebook share counts and media inlink counts to support further analysis by source influence.

Those features still exist in the cross-platform version of Topic Mapper, but they have been augmented by the ability to increase your corpus through new platforms and data sources, and by the ability to evaluate additional influence metrics (see the table further in the post). The cross-platform version of Topic Mapper also allows for an increased number of research questions studying how links are shared across platforms and related patterns and trends.

HOW TO RUN A CROSS-PLATFORM TOPIC

This tutorial video, presented at the 2020 ICWSM Conference, takes Media Cloud users through the process of creating their own cross-platform topics:

If you’d prefer a written walkthrough of creating a cross-platform topic, keep reading. If you’re more interested in key features and our next steps, skip this section.

To create a topic, click on the “create a topic” button in Topic Mapper.

You’ll then be asked to enter basic information about your topic: its name, a short description and a date range. This is the date range that will be used in your entire topic unless you create a new version later on.

The “advanced settings” tab will let you select the number of rounds of spidering you wish to use for your topic; this can range from zero rounds of spidering (which means you’ll just use stories Media Cloud has already found for a given source or collection) to fifteen rounds of spidering (which means we’ll follow links within stories 15 times to see if any additional stories match your search parameters). Like before, public users of Media Cloud are limited to 100,000 stories/topic in order to avoid overloading the system. If you have questions about this limit or structuring your search query, email support@mediacloud.org.

At this point, you’ll be able to add platforms to your topic. Every topic must have an open web component, which is essentially the legacy functionality of Topic Mapper; topics do not have to use additional platforms as data sources. After you have added open web search parameters to your topic, you can use the following interface to add other platforms to your topic.

The initial platforms supported in topic mapper are listed in the table below:

After you have added open web search parameters (and any other platforms you’d like to include) to your topic, you’ll be able to generate the first version of your topic.

At this point, your topic will start running. Your topic will take more or less time to run depending on your search parameters, including the timespan of discovery, number of collections and platforms used, and the complexity of your query.

Your topic may also take more time to run if there is high user interest in Media Cloud at any given time (this usually happens when particularly newsworthy events are taking place). In order to give yourself time to have your topic run, we suggest budgeting anywhere from two days to two weeks depending on the complexity of your topic parameters. If you are waiting on a topic that hasn’t finished running after two weeks have passed, we recommend emailing support@mediacloud.org.

KEY FEATURES AND NOTES

Links found through any of the platforms mentioned above are then checked against your query terms and added to the corpus for spidering. Stories found linked to on Reddit are automatically put into a Reddit subtopic. In other words, if Media Cloud finds URL A and URL B in Reddit submissions, and URL C and URL D on the open web, all four URLs will be available on the homescreen for your topic, but URLs A and B will also be in a “Reddit” subtopic.

For any platform incorporated into a topic, metrics are calculated and aggregated within the corpus, which means you get a number for “relevant” shares rather than total shares within your platform’s subtopic. For instance, if the Centers for Disease Control and Prevention (CDC) are a major source of URLs within your topic, you might see a lot of Facebook shares for the CDC — but not all of those shares are relevant to your topic specifically. Instead of giving you a Facebook share count for everything from the CDC, Media Cloud returns the number of times CDC URLs were shared within your topic corpus.

Subtopics are a way for you to slice and dice your data within a topic for further analysis and evaluation. Subtopics are created for any platform added to a topic, but they can also be run along the lines of source partisanship in the United States, keyword searches within stories, and by media type, theme, or top countries within coverage. Keep in mind that searches by media type are dependent on the quality of metadata available for a given source or collection within Media Cloud (which you can evaluate using our Source Manager tool), and that subtopics based on themes and top countries are applicable only to English-language stories.

Versions are a way for you to track progress in your research and expand your corpus as your work on a given topic continues. When you create a new version in Topic Mapper, you will see three options, allowing you to expand your date range or increase the number of rounds of spidering in your topic, add or modify subtopics as detailed in the previous paragraph, and add or modify platforms, which is also how you would create platform-specific subtopics. Keep in mind that you can only create versions that expand your topic. You can’t remove data from a topic, though you can filter your topic’s data using subtopics, or revisit the data stored in a previous version.

To better understand how data is treated within a topic on Media Cloud, we recommend studying this topic dataflow diagram. (You can find a high-res PDF here.)

CONCLUSION AND NEXT STEPS

Cross-platform topics have been over a year in the making, thanks to extensive coding work from our technical team and careful thought from our research team to guide how to best support more research in Media Cloud. This work would not have been possible without funding from the Knight Foundation, and we are grateful for their support.

Going forward, we plan to add more platform plugins (as their APIs and such allow for). We are specifically planning to add data from verified Twitter accounts using PushShift.io and YouTube data via their API. We welcome input about our work so far, and hope that you find cross-platform topics as exciting as we do!

1. Media Cloud Query Guide

‍2. PushShift Query Guide

‍3. Standard Date Formats

Media Cloud System Updates

Important Update: Database Switchover on December 7, 2023

Update to Tooling: Multiplatform Data Availability