I do go on rather a lot about enabling or hyper-activating[cite]10.1039/P29950000007[/cite] data. So do others[cite]10.1038/nj7461-243a[/cite]. Why is sharing data important?
- Reproducibility is a cornerstone in science,
- To achieve this, it is important that scientific research be open and transparent.
- Openly available research data is central to achieving this. It is estimated that less than 20% of the data collected in chemistry is made available in any open manner.
- RCUK (the UK research councils) wish increased transparency of publicly funded research and availability of its outputs‡
But it’s not all hot air, honestly. Peter Murray-Rust and I had started out on a journey to improve reproducibility, openness and transparency in (inter alia) scientific publishing in 1994. In 2001 we published an example of a data-rich article[cite]10.1039/b008780g[/cite] based on CML, and by 2004 the concept had evolved into something Peter termed a datument[cite]10.1186/1758-2946-5-6[/cite]. Some forty such have now been crafted.[cite]10.6084/m9.figshare.797481[/cite]
In 2009, the journal Nature Chemistry was starting up, and I approached them with the idea of an interactive data exploratorium on the premise that a new journal might be receptive to new ways of presenting science. It was accepted and published[cite]10.1038/nchem.373[/cite] and was followed in 2010 by a second variation.[cite]10.1038/NCHEM.596[/cite] In both cases, these activated-figures were sent to the journal as part of the submission process, and hosted by them (they still are). You can even access them without a subscription to the journal!
Move on to 2012, when David Scheschkewitz had some very exciting silicon chemistry to report, we collaborated on some computational modelling, and sent the resulting article to Nature Chemistry for publication. This included the usual interactive table reporting the modelling and its data. However, it transpired that the production workflows for Nature Chemistry had been streamlined and I was informed that interactive tables could no longer be accepted. This time, we (i.e. the authors) would have to solve the issue of how to host and present the data ourselves.
I was very keen that this table be treated with equal weight to the article itself (citable in its own right) and that it not be downgraded to supporting information (ESI). My objection to ESI is that it is often poorly structured by authors, i.e. it is not prepared in a form which allows the data to be re-used, either by a perceptive human, or a logical machine. As a result it is often given little attention by referees (although bloggers seem to do a far better job) and furthermore can end up being lost behind a pay wall (the two Nature Chem interactive objects noted above can be openly accessed, but only if you know that they exist). So I determined that:
- The table should be immediately accessible by non-experts, but not through any convoluted processes of downloading a file, expanding it and finding the correct document within the resulting fileset to view in the correct program, which is how normal ESI is handled.
- The table and the data it contained within should be capable of acting as a scientific tool, forming what could be the starting point for a new investigation if appropriate.
To solve this issue, some lateral and quick thinking was needed. The solution was a two-component model in which the original article is treated as a “narrative“, intertwingled with a second, but nevertheless distinct component, the “data“. This data would follow the principles of the Amsterdam Manifesto; it would itself be citable. The two components would become symbiotes (a datument). The narrative[cite]10.1038/nchem.1751[/cite] could cite this data and the data could back-link to the narrative. The data would inherit trust (i.e. peer review) from that applied to the narrative and the latter would inherit a date stamp and integrity from the data host (in this case Figshare[cite]10.6084/m9.figshare.744825[/cite]).*
The data itself can have two layers, presentation [cite]10.6084/m9.figshare.744825[/cite]¶ using a combination of software (Jmol or JSmol for chemistry) which are used to invoke the “raw” data. That data itself is citable[cite]http://dx.doi.org/10042/20409[/cite] (this is just a single example, resident as it happens on a different repository). The reader can choose use just the presentation layer or the underlying data.
The data object can be embedded in other pages; here it is below. The data sources for this table are themselves citable[cite]10.6084/m9.figshare.96410[/cite].
What are the advantages of such an approach? (the “what’s in it for me” question often asked by research students and their supervisors)
- Each of the components is held in an environment optimised for it and so can be presented to full advantage.
- The conventional narrative publisher does not necessarily also have to develop their own infrastructures for handling the data. They can choose to devolve that task to a “data publisher”.
- The data publisher (Figshare in this case) makes the data open. One does not need an institutional subscription to access it.
- “Added value” for each component can be done separately. Thus most narrative publishers would not necessarily wish to develop infrastructures for validating it or subsequently mining such “big data”. Indeed data mining of journals is prohibited by many publishers; it simply is either not possible or rendered so administratively difficult as to be impractical.
- Whilst a narrative article must clearly exist as a single instance (otherwise the authors would be accused of plagiarism), data can have multiple instances. Indeed, there exist protocols (SWORD) for moving data from one repository to another as the need arises. Publishing the same data in two or more locations is not currently considered plagiarism!
- The data component can be published as part of an article or say as part of a PhD thesis. This way, the creator of the data gets the advantages not of a date stamp associated with a narrative citation but of a much earlier stamp associated more closely with the actual creation of the data. That could easily and usefully resolve many disputes about who discovered what first, leaving the other issue of who interpreted what first to the narrative. I should mention that it is perfectly possible to “embargo” the data deposition so that it only becomes public when the narrative does (although you may choose not to do this).
- A data deposition cannot be modified, but a new version (which bidirectionally links back to the old one) can be published if say more data is collected at a future date.
- A whole infrastructure devoted just to enhancing the cited data can evolve; one that is unlikely to do so if the narrative publishers are the only stakeholders. For example, synthetic procedural data can be tagged using the excellent chemical tagger.
- It is relatively simple (=cheap) to build a pre-processor for publishing data, which for a research student can act as an electronic laboratory notebook, holding meta-data about the deposited/published data and the handles (doi) associated with each deposition. I have been using such an environment now for about seven years as the e-notebook for this blog for example. Thus the task of preparing figures and tables for a publication (or a blog post) is greatly facilitated. The same system is also used by research students and undergraduates for their lab work.
- I have noted previously how e.g. Google Scholar identifies data citations along with article citations in constructing an individual research profile. A researcher could become known for their published data as well as their published narratives. Indeed, it seems likely that the person who acquires and publishes the data, i.e. the research student, would then get accolades directly rather them all accruing to their supervisor.
But what can you, gentle reader of this blog, do to help? Well, ask if your institution already has, or plans to create a data repository. It can be local (we use DSpace) or “in-the-cloud” (e.g. Figshare). If not, ask why not! And if you are planning to submit an article for publication in the near future, ponder how you might better share its data.
‡ As first circulated on 28 April, 2011. See
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/policyframework.aspx
†The example given at the start of this post[cite]10.1038/nchem.1751[/cite] contains only one table processed in this manner; the actual synthetic procedures are still held in more conventional SI.
*This blog uses the excellent Kcite plugin to manage citations.
¶The good folks at Figshare were extremely helpful in converting this deposition into an interactive presentation. Thanks guys!
Tags: chemical tagger, data mining, datument, David Scheschkewitz, e-notebook, Google, opendata, Peter Murray-Rust, pre-processor, researcher, scientific tool, supervisor, United Kingdom
News releases on this topic can be now seen at
http://www3.imperial.ac.uk/newsandeventspggrp/imperialcollege/newssummary/news_13-9-2013-19-9-5
http://www.uni-saarland.de/nc/aktuelles/artikel/nr/8985.html
http://figshare.com/blog/Interactive_content_in_scholarly_publishing/99
I include here a shot of the Figshare page for the data noted above, illustrating how the data back-cites the narrative.
[…] Chemistry with a twist « A two-publisher model for the scientific article: narrative+shared data. […]
Hi Henry
Can you ellaborate a bit on how the data are loaded?
Is figshare hosting the molecule files, the Jmol applet files?
Angel,
There are two ways of uploading data.
1. Using Figshare’s own app (and here you can drag-n-drop multiple files if you wish). These files are not subjected to any analysis or validation; they are just flat files, and they can be in any format. The user then has to supply all the tags/metadata to go with these files, but again there is no check that the metadata actually is relevant to the files.
2. Using the Figshare API (which in part was produced because users like us really pressed for it), one can build a front-end (in our case using PHP) which can perform some of the logic missing from mechanism 1 above. This front end for us in fact started as a system to submit jobs to an HPC queue, and to collect the outputs. These outputs are then scripted to run through e.g. OpenBabel (and we are also thinking of putting them through the CDK), an operation which automatically generates lots of directly relevant metadata and validation. It also creates CML files. This front end then collects all the various outputs and metadata and produces a so-called fileset, which using the Figshare (or indeed DSpace) APIs, is injected into the repository. This last operation therefore boils down to a single click of a publish button (initially as a private deposition, viewable only by enrolled collaborators, and then after a second click, it is made public to be viewable by all).
We then branched out to create a slightly separate PHP-scripted environment which does not initially submit to an HPC queue. This is now used to upload other types of data, such as spectroscopy and crystallography. Here, metadata has to be generated by trawling through all the uploaded files looking for anything that OpenBabel can convert into an InChI identifier. Mimimally, that source can be e.g. a ChemDraw file.
The product of those operations is still a flat fileset at Figshare. To create something which involves Jmol or JSmol, this is still a “human endeavour”. This fileset is hand-crafted as an HTML file which invokes Jmol or JSmol as appropriate, and then it and all the associated script and local data files are zipped up and uploaded using mechanism 1 and the Figshare uploader. There, the good Folks at Figshare change the attributes of the files (most particularly eg root document index.html) so that Figshare marks it as “view in browser” rather than “download”. This is still very much an upon-request operation, and very much limited to pilot-collaborators; it is NOT yet a routine operation (but demand from users may accelerate it being so. Hint!)
Some more examples of this can be found in my talk on the topic.
We are about to try a new way of doing it, whereby in essence a simple JSmol template is uploaded, and it is then populated by directly referencing the Fighare doi. It is our hope that in the “load file” JSmol command, it can in fact be simply “load doi?identifier” where the doi suffix ?identifier resolves to a local path on the Figshare repository, and retrieves the file directly from there. This has the exciting prospect that this template can then be directly produced by e.g. JSmol itself using the “export to web” feature, and that depositions such as http://doi.org/10.6084/m9.figshare.744825 could be largely automated.
Any help that the “Jmol community” could give to this last project would be of course enormously appreciated. I hope that we might have a demonstrator ready in a week or so, but it will still be far from a “finished product”.
As you can see, it is my hope that once the above is made easy to achieve, the standard PhD thesis or research article can simply transclude (re-use) such objects as required, and then focus only on the narrative.
[…] the two-publisher model I proposed a post or so back, I showed an example of how data can be incorporated (transcluded) into the story […]
[…] And yet another data-doi could be created showing the interactive display, and this could be transcluded back into Steve’s blog to continue the to and […]
The Royal Society of Chemistry has issued a press release in which they announce their intention to offer, inter alia, a data repository of their own.
The momentum is certainly gathering pace!
[…] of my recent posts (this one on dual-publisher models and this one on publishing procedures) also pertain to this and Peter Murray-Rust is constantly […]