Harnessing FAIR data: A suggested useful persistent identifier (PID) for quantum chemical calculations.

Harnessing FAIR data is an event being held in London on September 3rd; no doubt all the speakers will espouse its virtues and speculate about how to realize its potential. Admirable aspirations indeed. Capturing hearts and minds also needs lots of real life applications! Whilst assembling a forthcoming post on this blog, I realized I might have one nice application which also pushes the envelope a bit further, in a manner that I describe below.

The post I refer to above is about using quantum chemical calculations to chart possible mechanistic pathways for the reaction between a carboxylic acid and an amine to form an amide. The FAIR data for the entire project is collected at DOI: 10.14469/hpc/4598. Part of what makes it FAIR is the metadata not only collected about this data but also formally registered with the DataCite agency. Registration in turn enables Finding; it is this aspect I want to demonstrate here.

The metadata for the above DOI includes information such as;

  1. The ORCID persistent identifier (PID) for the creator of the data (in this instance myself)
  2. Date stamps for the original creation date and subsequent modifications.
  3. A rights declaration, in this case the CC0 license which describes how the data can be re-used.
  4. Related identifiers, in this case describing members of this collection.

The data itself is held in the members of the collection, each of which is described by a more specific set of metadata in addition to the more general types in the above list (e.g. 10.14469/hpc/4606).

  1. One important additional metadata descriptor is the ORE locator (Object Re-use and Exchange, itself almost a synonym for FAIR). This allows a machine to deduce a direct path to the data file itself, and hence to retrieve it automatically if desired. It is important to note that the DOI itself (i.e. 10.14469/hpc/4606) points only to the “landing page” for the dataset, and does not necessarily describe the direct path to any specific file in the dataset. The ORE path can be used with e.g. software such as JSmol to directly load a molecule based only on its DOI. You can see an example of this here.
  2. Each molecule-based dataset contains additional specific metadata relating to the molecule itself. For example this is how the InChiKey, an identifier specific to that molecule, is expressed in metadata;
    <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">PVXKWVPAMVWJSQ-UHFFFAOYSA-N</subject>
    The advantage of expressing the metadata in this way is that a general search of the type:
    https://search.datacite.org/works?query=subjexts.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N
    can be used to track down any molecule with metadata corresponding to the above InChIkey.
  3. Here is more metadata, introduced in this blog. It relates to the (computed) value of the Gibbs energy (the energy unit is in Hartree), as returned by the Gaussian program;
    <subject subjectScheme="Gibbs_Energy" schemeURI="https://goldbook.iupac.org/html/G/G02629.html" valueURI="http://gaussian.com/thermo/">-649.732417</subject>
    I here argue that it represents a unique identifier for a molecule calculation using the quantum mechanical procedures implemented in e.g. Gaussian. This identifier is different from the InChIkey, in that it can be truncated to provide different levels of information.

    • At the coarsest level, a search of the type
      https://search.datacite.org/works?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*
      should reveal all molecules with the same number of atoms and electrons whose Gibbs energy has been calculated, but not necessarily with the same InChI (i.e. they may be isomers, or transition states, etc). This level might be useful for revealing most (not necessarily all) molecules involved in say a reaction mechanism. It should also be insensitive to the program system used, since most quantum codes will return a value for the Gibbs energy if the same procedures have been used (i.e. DFT method, basis set, solvation model and dispersion correction) accurate to probably 0.01 Hartree.
    • The top level of precision however is high enough to almost certainly relate to a specific molecule and probably using a specific program;
      https://search.datacite.org/works?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.732417
    • The searcher can experiment with different levels of precision to narrow or broaden the search.
    • I would also address the issue (before someone asks) of why I have used the Gibbs energy rather than the Total energy. Put simply, the Gibbs energy is far more useful in a chemical context. It can be used to relate the relative Gibbs energies of different isomers of the same molecule to e.g. the equilibrium constant that might be measured. Or the difference in Gibbs energies between a reactant and a transition state can be used to derive the free energy activation barrier for a reaction. The total energy is not so useful in such contexts, although of course it too could be added as a subject in the metadata above if a real use for it is found.
  4. The searcher can also use Boolean combinations of metadata, such as specifying both the InChIKey and the Gibbs Energy, along with say the ORCID of the person who may have published the data;
    https://search.datacite.org/works?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*+AND+ subjects.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N+AND+contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390

I have tried to show above how FAIR data implies some form of rich (registered) metadata. And how the metadata can be used to Find (the F in FAIR) data with very specific properties, thus Harnessing FAIR data.


It is a current limitation of the V4.1 DataCite schema that there appears no way to specify the data type of the subject, including any units.

In theory, a range query of the type:
https://search.datacite.org/works?query=subjects.subjectScheme:Gibbs_energy+AND+subjects.subject:[\-649.1 TO \-649.8] should be more specific, but I have not yet gotten it to work, probably because of the lack of data-typing means it is not recognised as a range of numeric values.

Implicit in this search is the grouping
https://search.datacite.org/works?query=(subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*) + (subjects.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N)+AND+contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390
Currently however DataCite do not correctly honour this form of grouping.

Video of the speakers and the panel session at the end is now available.

Tags: , , , , , , , , , , , , ,

9 Responses to “Harnessing FAIR data: A suggested useful persistent identifier (PID) for quantum chemical calculations.”

  1. I agree with the post. DataCite has a lot of great features for annotating data and creating persistent identifiers – whether on a university repository, group website, or a service like FigShare.

    But the cost… We minted 100k DOIs for almost nothing to create the Pitt Quantum Repository: https://pqr.pitt.edu/ – that was through the EZID service which has now been discontinued for DataCite.

    We’re being told that the whole university can only mint ~10k identifiers per year, which makes it very hard to annotate individual calculation sets.

    Considering I’m part of an NSF-funded effort to create a standardized computational chemistry repository, I’m hoping we can iron out the pricing issue, though.

  2. Henry Rzepa says:

    The cost seems to derive from the Datacite agent used. Thus in the UK, we use the British Library as agent. Although the cost is “hidden” because it is born by e.g. the Imperial central library, we have never been informed of any limitation to the number of DOIs that can be minted. In one project, we minted ~200,000 over a period of about one week, with no issue. So any restrictions do seem to be associated with the agent, and no doubt can be subjected to negotiations.

  3. Henry Rzepa says:

    Geoff,

    I have had a look at the metadata associated with entries at https://pqr.pitt.edu/ using e.g.
    https://data.datacite.org/application/vnd.datacite.datacite+xml/10.17614/Q45H7C63J
    to retrieve the metadata for one entry (anisole).

    It is pretty bare bones,
    <resource xmlns="http://datacite.org/schema/kernel-3"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://datacite.org/schema/kernel-3
    http://schema.datacite.org/meta/kernel-3/metadata.xsd">
    <identifier identifierType="DOI">10.17614/Q45H7C63J</identifier>
    <creators>
    <creator>
    <creatorname>Pitt Quantum Repository</creatorname>
    </creator>
    </creators>
    <titles>
    <title>anisole</title>
    </titles>
    <publisher>University of Pittsburgh</publisher>
    <publicationyear>2015</publicationyear>
    <resourcetype resourceTypeGeneral="Dataset"></resourcetype>
    </resource>

    Have you thought about enriching it, along the lines of including a rights declaration (CC0?), a full date stamp (not just the year), an ORE resource map (useful for attaching a toolkit to the data, such as Avogadro or JSmol), an InChI and perhaps some form of energy? This would allow such fields to be searched for. It also opens up the possibility of discovering molecules with the same values of these metadata fields in other repositories.

  4. Henry Rzepa says:

    re NSF computational repository project. Have you considered eg joining

    https://sites.google.com/view/digchem/ and in particular https://sites.google.com/view/digchem/datacite-recommendations

    where exploitation of eg the DataCite schema for chemistry will be discussed. It is vita that the community uses a common dictionary so that searching is properly facilitated.

  5. Henry Rzepa says:

    The expression for the Gibbs_Energy metadata was above given as

    <subject subjectScheme="Gibbs_Energy" schemeURI="https://goldbook.iupac.org/html/G/G02629.html" valueURI="http://gaussian.com/thermo/">-649.732417</subject>

    A better, more robust expression could be

    <subject subjectScheme="Gibbs_Energy" schemeURI="https://doi.org/10.1351/goldbook.G02629" valueURI="http://gaussian.com/thermo/">-649.732417</subject>

    where the parochial path to the Gold Book term is now expressed by a more persistent DOI identifier. We have changed the metadata generator on our repository to use this new form, which is now recommended.

    I am also investigating whether the valueURI term could also be expressed as a DOI.

  6. Henry Rzepa says:

    The concept that metadata is key to the “rapid discovery of data” is also adopted with commercial solutions to data management. Thus http://www.arcitecta.com/Products/Mediaflux where apparently Metadata is the key, along with automated workflows. There are examples of its use in healthcare, genome research and oceanography, but not chemistry.

  7. Henry Rzepa says:

    Here is a comment from a long thread about data (or its absence) in a recent article on a claimed room temperature superconductor; http://blogs.sciencemag.org/pipeline/archives/2018/08/13/a-room-temperature-superconductor-well#comment-295123

    In particular final recipes were corrected in final submissions or even as late as galley proofs before final publication A strong argument for making date-stamped FAIR data available at the time of submission/refereeing. Date-stamping in turn should be to the nearest second, not just the year. Date-stamps help to protect authors from suspected unethical behaviour by referees, and are usefully and significantly separate from article submission/publication dates.

  8. Henry Rzepa says:

    Here is an interesting search;
    https://search.datacite.org/works?query=subjects.subjectScheme:*

    which reveals that 2,138,090 works have a subject term specified. This looked quite encouraging until I inspected how they might be defined. Thus one example was

    Minerals

    which in fact is semantically void and not particularly useful. The term “parameter” is useless unless it is defined further and has some context placed upon it. But we are slowly getting there!

  9. Henry Rzepa says:

    At PIDapalooza 2019, a new engine powering the DataCite search queries was explained, with the old Solr base being replaced by ElasticSearch. Unfortunately, the new interface is not backwards compatible and all the search syntax has changed.

    I have refactored the content above into the new syntax. The changes include

    1. Each element has to have the full hierarchy reflected in the syntax. Thus subjectscheme is replaced by subjects.subjectscheme.

    2. The old ORCID shortcut now formally reflects the schema, thus contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390

    Note the prefix wildcard in the ORCID, which allows for the full proper ORCID, which is prefixed with https://orcid.org/ to be found.

    3. The “+” character representing a Boolean, is now replaced by +AND+

    4. The “-” of a negative floating point number has to be escaped, thus subject:\-649.* where \ allows the minus sign to be recognised and * allows any number of wildcards following the decimal place.

    5. The range query now seems to work, ie [\-649.1 TO \-649.8] now returns a value of -649.7

    6. The issue relating to grouping noted in the post is now registered; https://github.com/datacite/lupo/issues/189

Leave a Reply