Overton Blog

Open COVID-19 policy dataset

The COVID-19 policy dataset is a zip file of policy documents from 2020 relating to the COVID-19 pandemic (the search strategy is documented below). The documents are in PDF format and have an accompanying JSON file explaining their provenance: where they were collected from, their publication date, original location etc.

A policy document in this context is something written mostly for or by policymakers - so guidance, policies, white papers, strategies and so on. Overton looks at think tanks, policy orientated NGOs (e.g. groups like Amnesty International or Greenpeace) and IGOs (e.g. the World Health Organization or the World Bank) as well as national governments, so you may find a mix of official documents and think tank policy briefs.

Where can I get it?

You can find the 31st March snapshot on figshare at https://dx.doi.org/10.6084/m9.figshare.12055860

Give me some example documents, please

(just click on the "Read full text..." buttons on the top left of the Overton page to get to the actual PDF. The full text PDFs are included in the dataset file)

Here's an example of the JSON metadata for the first WHO document listed above.

Some things to watch out for

The majority of documents are in English, but there are other languages in there too (esp French, Dutch, German).

They are from a variety of different dates in 2020. The same document may appear multiple times in the dataset in different forms.

They are heterogeneous in structure, length and language use - some are aimed at the public, some at key workers, some at policy makers, some at academics.

You might want to filter out non-governmental sources. These have "think tank" in the type field in their associated JSON metadata. Don't just include "government" as this will leave out intergovernmental sources like the WHO, IMF and World Bank.

This is an incomplete picture of global coronavirus policy: it's drawn from the list of sources that Overton indexes, which is driven by availability and what has been useful to research administrators and funders in the past - that doesn't always include documents aimed at the general public.

Directory structure and JSON metadata schema

When unzipped you'll have a set of PDFs named like so:

<source id>-<document id>-<pdf id>.pdf

And a set of JSON documents named like so:

<source id>-<document id>.json

One document may have multiple PDFs associated with it. For example, a Congressional hearing report may have a separate PDF for each witness testimony, or some WHO guidelines may have a separate PDF for the same document in different languages.

The JSON for each document contains the title, translated title (if the document title isn't English), a snippet containing a brief description of the document, a published on date and a landing page (the page Overton originally collected the document from).

The "source info" object contains information about the source of the document e.g. the World Health Organization.

The "pdfs" object lists the PDFs and their filenames associated with this document. It also includes a list of DOIs that are cited by the PDF, along with the page number, paragraph number and text snippet where Overton thinks that reference was found.

Search strategy

In Overton we searched the policy tab with 2020 selected as the year and the query:

"2019-nCoV" OR "covid-19" OR "SARS-CoV-2" OR "coronavirus" OR "acute respiratory"

Then filtered out any documents that didn't have at least one of those keywords in the title or extracted snippet.

I've still got questions

Email us! support@overton.io

What is Overton

We help universities, think tanks and publishers understand the reach and influence of their research.

The Overton platform contains is the world’s largest searchable policy database, with almost 5 million documents from 29k organisations.

We track everything from white papers to think tank policy briefs to national clinical guidelines, and automatically find the references to scholarly research, academics and other outputs.