We’ve tried to make it easy to use Overton as a university from the first login - and one way we do that is to automatically show you the policy documents that cite people or papers from your institution:
We can do that because the scholarly papers that are cited in policy have authors attached to them, and those authors have institutional affiliations - the ones displayed in the article PDF, or on the article page on the journal website.
That’s the theory anyway - in practice, getting this data in a form that’s usable is tricky. Scholarly publishers share metadata about their papers via an organization called Crossref, which makes their database open for anybody to access. Crossref is an incredibly important piece of scholarly infrastructure, and the team there do a lot of really great work helping & educating publishers and ensuring the data that they hold is useful and accessible.
Ultimately though they can only work with the data that publishers give them, and standards have historically been pretty spotty. While basic metadata like titles and author names are usually easy to get, other things like publication dates and affiliations are much harder.
Affiliations in particular weren’t considered “standard” metadata that should be deposited in Crossref until fairly recently, so lots of older papers are missing this data, and lots of papers from smaller publishers are still missing it.
Furthermore until recently the affiliations were just snippets of text straight from author submission systems and potentially containing typos - how do you know that:
- University College London
- University Collage London
- Universität Hamburg
- University of Hamburg
… are the same institution?
The solution for us until recently was to use Microsoft Academic Graph (MAG). MAG pulled in affiliation data from Crossref but supplemented this with data from other sources, including publisher websites - because Microsoft also makes Bing they had the full text of many journal article pages. They then wrote algorithms to map different versions of institution names to standard IDs, one ID per university.
Without MAG we could find affiliation data for something like 40% of the articles we saw cited in policy. With MAG this jumped up to ~ 80% (this is about as good as it gets without involving researchers directly and at length: the data just isn't available for lots of papers).
Unfortunately MAG announced that they were closing down at the start of 2021, so we were left with the prospect of having to pull in affiliation data for any given scholarly article ourselves: in fact, we built a system to do so - it was complicated and high maintenance.
Then OpenAlex got released.
OpenAlex is an open replacement for MAG and I really can’t say enough good things about it. It’s new and so unsurprisingly has some teething problems - which the team are open about on Twitter - but it’s obvious that it’s written by people who deeply care both about scholarly metadata and making that data open and immediately usable, both through APIs and by making the entire database downloadable for free.
We started using it in November and it has allowed us to drop a lot of code that integrated with various other systems to get affiliation data and to focus on what we do best - collecting and analyzing policy documents.
Best of all because it’s open it’s transparent: you can see exactly the same data as we do, and hopefully - because it’s a shared resource that lots of apps and services draw on - publishers and universities will have an incentive to fix any errors in one place.
Let us know if you’re curious about any implementation details - you can also take a look at our help page on the subject - and there’s more detail on OpenAlex on their website.