Late last year, we started having some technical problems with our databases and front-end tools, which coincided with some key staff departures. We’ve been working hard since then to fix the bugs and staff up, but as many of you who have tried to use Explorer or our API have learned, there are still some ongoing issues. While we still have people (including partners to whom we are grateful) working towards solutions, we want to be as clear as possible with our broader community of journalists, academics, and educators so you can set research expectations accordingly. This blog post aims to outline how to best use our system in reliable and effective ways right now.
In our March 2022 blog post, we noted that some people were having trouble logging in. We believe this issue has been resolved. The password change feature does not work, however. Drop us an email if you are having trouble logging in.
Media Cloud uses two databases. When you conduct a query, it searches our Solr database, which outputs “attention” in the form of story counts. Solr experienced one major period of downtime: December 25, 2021 - January 19, 2022. That means you will not get reliable data for story counts for that period. There are other gaps between roughly April 13, 2022 - May 5, 2022 (especially between April 29 - May 5). Queries which include these periods will have incomplete or missing data. We are investigating reports of duplicate stories in Solr even outside of these time frames.
The other database, Postgres, is used for nearly everything else (sample stories, top words, entities, URL downloads, etc.). It holds all the metadata for the 1.5+ billion stories, and has grown too large for our small team to manage effectively. Postgres is the database which failed in November 2021, and has experienced other technical problems through May 5, 2022. Practically speaking, most of the data in Postgres from the period September 15, 2021 - May 5, 2022 is incomplete. We have most of this data, but it is stored in corrupted databases. Recovering this data and merging it back into one place is one of the big projects we’re working on, but it is unclear if we will be able to, and in what timeframe. This means that even on days between September 15 - May 5 where you get accurate story counts through Solr, the data used for everything other than counts, including the story download, will be incomplete.
The database problems described above means features like top words, sample stories, themes, and entities for queries overlapping that period (September 2021 to May 2022) are likely based on incomplete data. Querying this period may generate errors in the web-based tools.
In Explorer, the entities list sometimes generates errors. This is especially common for queries with large numbers of stories. Reducing the scope of the query may yield better results. Themes frequently fail to load, regardless of query parameters.
There are some minor issues with Source Manager’s interface. Very large collections may not load the source list properly or may timeout. The search function of Source Manager does not work for sources with three characters or less, like CNN or RT. To view information about these sources, you can click through from a collection that contains them or use the source id number. None of the interface issues with Source Manager affect the collections themselves or the queries you conduct with them.
Finally, as described in a June 2022 update, Topic Mapper has been turned off.
In this blog post, we have tried to summarize ongoing issues with Media Cloud. Any research which involves the period between September 15, 2021 - May 5, 2022 risks using incomplete data. We are confident in the quality of data after May 5, 2022. For story counts before September 15, 2021, we recommend downloading the list of URLs and using the number you see there for story counts due to possible duplicates. We know these limitations affect the way many people want to use Media Cloud, and for that we are sorry.
Media Cloud is a small team working hard to build easy ways to monitor online news, however our main funds have never been earmarked for building and maintaining a public resource. We continue to do so because it is the best way to do our own work, and because we believe in the collective need. Going forward we are rethinking some of our technical system architecture to create a more maintainable system, and rebuilding an organizational structure more akin to a coalition to guide and contribute to the technology and methods we all use. Stay tuned for some exciting end-of-the-year updates about new tools, resources, and partners.
We will continue working to improve our system, and you are welcome to contact us with any questions (email@example.com).