In a welcome move, one of the American chemical society journals has published an encouragement to submit what is called FAIR data to the journal.[cite]10.1021/acs.orglett.0c00383[/cite]. A reminder that FAIR data is data that can be Found (F), Accessed (A), Interoperated(I) and Re-used( R). I thought I might try to explore this new tool here.
You start at the ACS Research Data Center with the tag line Submit your NMR Data. By this they mean the primary or “raw” NMR data as it emerges from a spectrometer. At this point I would note that primary data is not necessarily FAIR data yet. It is however a great deal more easily inter-operated and re-used than say the more conventional form of such data, which is a visual spectrum stored as a PDF file. If you did want to re-analyse the data, the primary data is the place to start, not the PDF spectrum!
The tool next asks to to drop your FID file into the upload area. Depending on the spectrometer type, this can take the form of a ZIP archive of various instrument files (typical of Bruker spectrometers) or just a single file (JDF, typical of Jeol spectrometers). The next request is for some “metadata” such as Title, Funder and Author(s), with an additional request to provide an ORCID for the latter. All these are easily provided. It was the next step where my exploration on this occasion had to stop, since the next button takes you to the Manuscript submission page, which can only be followed if you have a manuscript to complete!
What would I expect to happen next? Well, this metadata has to be augmented with molecule metadata, such as for example an InChI of the molecule. This is what would turn our primary data in fully FAIR data. To complete the process, the data and its now completed metadata descriptors would need to be Registered, in order to facilitate its discovery and hence enable the F of FAIR. This is normally completed with the DataCite registration agency, and in exchange you get a DOI corresponding to the registered metadata and you can then infer a link of the type https://data.datacite.org/application/vnd.datacite.datacite+xml/…your-allocated…DOI which allows you to inspect the metadata and search for it (see eg DOI: 10.14469/hpc/5920 for examples of such searches). Currently I do not know if this happens with this ACS tool. I would certainly like to inspect the collected metadata before I could comment on whether the title of this post is accurate, ie the encouragement of FAIR data. It would also be interesting to see what (if any) procedures are used to generate an InChI for the molecule and its NMR data, and exactly how that is also included in the metadata.
I would also note one other crucial aspect of this process, how to enable the A of FAIR. Primary or raw NMR data is entirely opaque (the files themselves are often binary encoded files) and you do need a tool to transform this data into visual or spectral form. So you will need to acquire such a tool, most often in the form of software such as MestreNova or Topspin. This can be a complex process, and may well involve paying the vendors money. In this context, I would note the Mpublish tool,[cite]10.1021/acsomega.8b03005[/cite] which allows a single-free-to-use license to be generated which allows e.g. MestreNova to be freely used for that dataset only. Some form of suitable Access to a FAIR dataset is an essential (if often unmentioned) component of the process.
At this stage therefore, there are quite a few questions about this new ACS system which I cannot provide answers to. On these answers will depend whether the process can be truly described as the submission of FAIR data. If anyone reading this manages to complete the process above, do please describe the subsequent experiences. I fancy there will have to be a future follow up to this post! Meanwhile, if you do have a manuscript you are ready to submit, give it a go and perchance report your experiences here!
Here is some analysis of the current procedures used by Org. Chem to associate data with an article, a system which has probably been in operation for around two years. Take for example, DOI: https://doi.org/10.1021/acs.joc.9b02631 with the title “Design, Synthesis, and Study of Lactam and Ring-Expanded Analogues of Teixobactin“. The same title occurs in Figshare (a data repository) with the URL https://figshare.com/articles/Design_Synthesis_and_Study_of_Lactam_and_Ring-Expanded_Analogues_of_Teixobactin/11342669 and one of the datasets found there is indicated as being citable using e.g. https://doi.org/10.1021/acs.joc.9b02631.s002 Now this looks like a regular DOI, but appended so as to be distinct from the article itself.
There is something however rather odd about this apparent DOI. It fails to resolve (try for yourself), and a corollary is that if it fails to resolve, it must also not have any associated metadata. As I noted in the post above, FAIR data can be defined in one sense by the richness of the registered metadata describing it. A lack of metadata also carries the implication that the data is not FAIR.
So the question now is whether Encouraging Submission of FAIR Data at The Journal will result in a different model from the one currently employed by the journal. I have written to the Data Editor of this journal to find out, and am awaiting a reply. When it comes, I hope to share it here.
I pondered above what might happen to the metadata gather for a raw NMR dataset (FID). One option is to register it with DataCite, and it would have to be cast to correespond to their Schema. Another schema is schema.org, with the declared mission statement being chema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond. . It was founded by Google, Microsoft, Yahoo and Yandex, presumably to enhance the quality of their own indexing. In Google’s case, it probably feeds eg https://datasetsearch.research.google.com and with Schema.org installed on your site, their crawler can then richly harvest your metadata.
I understand that this latter option is what the Org. Chem/Org. Lett, project will aim for in the first instance. At the moment, I have no examples of searches using Google that might benefit from this procedure. When they become available, it would be interesting to compare how such searches compare to those that can be enabled at DataCite, or for the chemistry community how they might compare with commercial indexing organisations such as Scifinder or Reaxys.