The COVID-19 policy dataset is a zip file of policy documents from 2020 relating to the COVID-19 pandemic (the search strategy is documented below). The documents are in PDF format and have an accompanying JSON file explaining their provenance: where they were collected from, their publication date, original location etc.
A policy document in this context is something written mostly for or by policymakers – so guidance, policies, white papers, strategies and so on. Overton looks at think tanks, policy orientated NGOs (e.g. groups like Amnesty International or Greenpeace) and IGOs (e.g. the World Health Organization or the World Bank) as well as national governments, so you may find a mix of official documents and think tank policy briefs.
Where can I get it?
You can find the 31st March snapshot on figshare at https://dx.doi.org/10.6084/m9.figshare.12055860
Give me some example documents, please
(just click on the “Read full text…” buttons on the top left of the Overton page to get to the actual PDF. The full text PDFs are included in the dataset file)
- WHO’s “Modes of transmission of virus causing COVID-19: implications for IPC precaution recommendations” (27th March)
- WHO’s “Infection prevention and control for the safe management of a dead body in the context of COVID-19” guidance (24th March)
- the “Australian Health Sector Emergency Response Plan for Novel Coronavirus” (23rd March)
- Interpol’s COVID-19 Pandemic – Guidelines for Law Enforcement (March 26th)
- Australian Dept of Health Advice for organizing public gatherings (15th March)
- the UK government’s Coronavirus Action Plan (3rd March)
- US Congress Oversight Committee’s hearing on Coronavirus Preparedness and Response (12th March)
Here’s an example of the JSON metadata for the first WHO document listed above.
Some things to watch out for
The majority of documents are in English, but there are other languages in there too (esp French, Dutch, German).
They are from a variety of different dates in 2020. The same document may appear multiple times in the dataset in different forms.
They are heterogeneous in structure, length and language use – some are aimed at the public, some at key workers, some at policy makers, some at academics.
You might want to filter out non-governmental sources. These have “think tank” in the type field in their associated JSON metadata. Don’t just include “government” as this will leave out intergovernmental sources like the WHO, IMF and World Bank.
This is an incomplete picture of global coronavirus policy: it’s drawn from the list of sources that Overton indexes, which is driven by availability and what has been useful to research administrators and funders in the past – that doesn’t always include documents aimed at the general public.
Directory structure and JSON metadata schema
When unzipped you’ll have a set of PDFs named like so:
<source id>-<document id>-<pdf id>.pdf
And a set of JSON documents named like so:
<source id>-<document id>.json
One document may have multiple PDFs associated with it. For example, a Congressional hearing report may have a separate PDF for each witness testimony, or some WHO guidelines may have a separate PDF for the same document in different languages.
The JSON for each document contains the title, translated title (if the document title isn’t English), a snippet containing a brief description of the document, a published on date and a landing page (the page Overton originally collected the document from).
The “source info” object contains information about the source of the document e.g. the World Health Organization.
The “pdfs” object lists the PDFs and their filenames associated with this document. It also includes a list of DOIs that are cited by the PDF, along with the page number, paragraph number and text snippet where Overton thinks that reference was found.
In Overton we searched the policy tab with 2020 selected as the year and the query:
“2019-nCoV” OR “covid-19” OR “SARS-CoV-2” OR “coronavirus” OR “acute respiratory”
Then filtered out any documents that didn’t have at least one of those keywords in the title or extracted snippet.
I’ve still got questions
Email us! email@example.com