A search of some major chemistry publishers for FAIR data records.

In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.

Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.

One can query thus:

https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
which retrieves the very healthy looking 6,179,287 works.
One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
?query=relatedIdentifiers.relatedIdentifier:10.1021*
which returns a respectable 210,240 works.
It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*)
and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.

I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).

Publisher	Search 2	Search 3
ACS	210,240	14,213
RSC	138,147	1,279
Elsevier	185,351	56,373
Nature	12,316	8,104
Wiley	135,874	9,283
Science	3,384	2,343

These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.

How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?

?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
And just to show the searches are behaving as expected:
?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
returns 196,027 works.

It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.

Finally, we have not really explored adherence to eg the AIR of FAIR. That is for another post.

Author

Henry Rzepa

Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.
View all posts

Tags: Academic publishing, DataCite, Digital Object Identifier, Digital technology, Elsevier, Findability, Identifiers, Information, Information architecture, Information science, Knowledge, Knowledge representation, search service, Web design

This entry was posted on Friday, April 12th, 2019 at 5:18 pm and is filed under Chemical IT. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “A search of some major chemistry publishers for FAIR data records.”

Henry Rzepa says:

April 26, 2019 at 7:15 am

I noted above the asymmetry between pointers from data to related identifiers such as articles compared to the reverse direction of pointers from articles to data.

Ian Bruno has kindly sent me three links which highlight or start to address this issue:
1. https://blog.datacite.org/citation-analysis-scholix-rda/ (Glad You Asked: A Snapshot of the Current State of Data Citation)
2. https://doi.org/10.1038/sdata.2018.259 (A data citation roadmap for scientific publishers)
3. https://dliservice.research-infrastructures.eu/#/ (Search and browse in 620.000 literature objects, 2.600.000 datasets, 18.000.0000 bi-directional scholix links, from 1000 publishers, 10 data centers, CrossRef, DataCite, and OpenAIRE.)
It seems lots is starting to happen!

Reply

Henry Rzepa's Blog