In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.
Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.
One can query thus:
- https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
which retrieves the very healthy looking 6,179,287 works. - One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
?query=relatedIdentifiers.relatedIdentifier:10.1021*
which returns a respectable 210,240 works. - It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*)
and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.
I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).
Publisher | Search 2 | Search 3 |
---|---|---|
ACS | 210,240 | 14,213 |
RSC | 138,147 | 1,279 |
Elsevier | 185,351 | 56,373 |
Nature | 12,316 | 8,104 |
Wiley | 135,874 | 9,283 |
Science | 3,384 | 2,343 |
These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.
How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?
- ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas. - And just to show the searches are behaving as expected:
?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
returns 196,027 works.
It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.
Finally, we have not really explored adherence to eg the AIR of FAIR. That is for another post.
Tags: Academic publishing, DataCite, Digital Object Identifier, Digital technology, Elsevier, Findability, Identifiers, Information, Information architecture, Information science, Knowledge, Knowledge representation, search service, Web design
I noted above the asymmetry between pointers from data to related identifiers such as articles compared to the reverse direction of pointers from articles to data.
Ian Bruno has kindly sent me three links which highlight or start to address this issue:
It seems lots is starting to happen!