In my previous post on the topic, I introduced the concept that data can come in several forms, most commonly as “raw” or primary data and as a “processed” version of this data that has added value. In crystallography, the chemist is interested in this processed version, carried by a CIF file. However on rare occasions when a query arises about the processed component, this can in principle at least be resolved by taking a look at the original raw data, expressed as diffraction images. I established with much appreciated help from CCDC that since 2016, around 65 datasets in the CSD (Cambridge structural database) have appeared with such associated raw data. The problem is easily reconciling the two sets of data (the raw data is not stored on CSD) and one way of doing this is via the metadata associated with the datasets. In turn, if this metadata is suitably registered, one can query the metadata store for such associations, as was illustrated in the previous post on the topic. Here I explore the metadata records for five of these 65 sets to find out their properties, selected to illustrate the five data repositories thus far that host such data for compounds in the CSD database.
Archive for the ‘Chemical IT’ Category
Raw data and the evolution of crystallographic FAIR data. Journals, processed and raw structure data.
Monday, March 28th, 2022Raw data: the evolution of FAIR data and crystallography.
Tuesday, March 1st, 2022Scientific data in chemistry has come a long way in the last few decades. Originally entangled into scientific articles in the form of tables of numbers or diagrams, it was (partially) disentangled into supporting information when journals became electronic in the late 1990s.[cite]10.1021/acs.orglett.5b01700[/cite] The next phase was the introduction of data repositories in the early naughties. Now associated with innovative commercial companies such as Figshare and later the non-commercial Zenodo, such repositories have also spread to institutional form such as eg the earlier SPECTRa project of 2006[cite]10.1021/ci7004737[/cite] and still evolving.[cite]10.1186/s13321-017-0190-6[/cite] Perhaps the best known, and certainly one of the oldest examples of curated structural data in chemistry is the CCDC (Cambridge crystallographic data centre) CSD (Cambridge structural database) which has been operating for more than 55 years now, even before the online era! Curation here is the important context, since there you will find crystal diffraction data which has been refined into a structural model, firstly by the authors reporting the structure and then by CSD who amongst other operations, validate the associated data using a utility called CheckCIF.[cite]10.1107/s090744490804362x[/cite] What perhaps is not realised by most users of this data source is that the original or “raw” data, as obtained from a X-ray diffractometer and which the CSD data is derived from, is not actually available from the CSD. This primary form of crystallographic data is the topic of this post.
Data base or Data repository? – A brief and very selective history of data management in chemistry.
Wednesday, January 26th, 2022Way back in the late 1980s or so, research groups in chemistry started to replace the filing of their paper-based research data by storing it in an easily retrievable digital form. This required a computer database and initially these were accessible only on specific dedicated computers in the laboratory. These gradually changed from the 1990s onwards into being accessible online, so that more than one person could use them in different locations. At least where I worked, the infrastructures‡ to set up such databases were mostly not then available as part of the standard research provisions and so had to be installed and maintained by the group itself. The database software took many different forms and it was not uncommon for each group in a department to come up with a different solution that suited its needs best. The result was a proliferation of largely non-interoperable solutions which did not communicate with each other. Each database had to be searched locally and there could be ten or more such resources in a department. The knowledge of how the system operated also often resided in just one person, which tended to evaporate when this guru left the group.
Quantum chemistry interoperability (library): another step towards FAIR data.
Saturday, January 1st, 2022To be FAIR, data has to be not only Findable and Accessible, but straightforwardly Interoperable. One of the best examples of interoperability in chemistry comes from the domain of quantum chemistry. This strives to describe a molecule by its electron density distribution, from which many interesting properties can then be computed. The process is split into two parts:
First came Molnupiravir – now there is Paxlovid as a SARS-CoV-2 protease inhibitor. An NCI analysis of the ligand.
Saturday, November 13th, 2021Earlier this year, Molnupiravir hit the headlines as a promising antiviral drug. This is now followed by Paxlovid, which is the first small molecule to be aimed by design at the SAR-CoV-2 protein and which is reported as reducing greatly the risk of hospitalization or death when given within three days of symptoms appearing in high risk patients.
A comparison of searches based on metadata records from three (update: five) research repositories.
Tuesday, September 28th, 2021In the previous blog post, I looked at the metadata records registered with DataCite for some chemical computational modelling files as published in three different repositories. Here I take it one stage further, by looking at how searches of the DataCite metadata store for three particular values of the metadata associated with this dataset compare.
A comparison of descriptive metadata across different data repositories.
Tuesday, September 28th, 2021The number of repositories which accept research data across a wide spectrum of disciplines is on the up. Here I report the results of conducting an experiment in which chemical modelling data was deposited in six such repositories and comparing the richness of the metadata describing the essential properties of the six depositions.
HPC Access and Metadata Portal (CHAMP).
Monday, September 13th, 2021You might have noticed if you have read any of my posts here is that many of them have been accompanied since 2006 by supporting calculations, normally based on density functional theory (DFT) and these calculations are accompanied by a persistent identifier pointer‡ to a data repository publication. I have hitherto not gone into the detail here of the infrastructures required to do this sort of thing, but recently one of the two components has been updated to V2, after being at V1 for some fourteen years[cite]10.1021/ci500302p[/cite] and this provides a timely opportunity to describe the system a little more.
Octopus publishing: dis-assembling the research article into eight components.
Friday, August 13th, 2021In 2011, I suggested that the standard monolith that is the conventional scientific article could be broken down into two separate, but interlinked components, being the story or narrative of the article and the data on which the story is based. Later in 2018 the bibliography in the form of open citations were added as a distinct third component.[cite]10.1038/d41586-018-00104-7[/cite] Here I discuss an approach that has taken this even further, breaking the article down into as many as eight components and described as “Octopus publishing” for obvious reasons. These are;
Room-temperature superconductivity in a carbonaceous sulfur hydride!
Saturday, October 17th, 2020The title of this post indicates the exciting prospect that a method of producing a room temperature superconductor has finally been achived[cite]10.1038/s41586-020-2801-z[/cite]. This is only possible at enormous pressures however; >267 gigaPascals (GPa) or 2,635,023 atmospheres.