There are many motivations for performing scientific research. One of these is the desire to advance public scientific knowledge.
This is a difficult concept to quantify or even qualitatively assess. One can try to use literature citations and impact factors but that captures only a small fraction of the true scientific impact. For example, one formal citation of our solubility dataset doesn’t represent the 100,000 anonymous solubility queries made directly to our database. And of these the actual impact will depend on exactly how the information was used. Egon Willighagen has identified this as a problem for the Chemistry Development Kit (CDK) as well: many more people use the CDK than reflected simply by the number of citations to the original paper.
There are a few of us who believe that curating chemistry data is a high impact activity. Antony Williams spends a considerable amount of time on this activity and frequently uncovers very serious errors from a number of data sources. Andrew Lang and I have put in a similar effort in collecting and curating solubility measurements openly – and recently (with Antony) we have been doing the same for melting points.
Although attempting to estimate the total impact of the curation activity isn’t really practical, we can look at a specific and representative example to capture the scope.
I recently exposed the situation with the melting point measurements of 4-benzyltoluene. In brief, the literature provided contradictory information that could not be resolved without performing an experiment. Although an exact measurement was not found, a limit was determined that ruled out all measurements except for one.
Ironically it turns out that the melting point of this compound is its most important property for industrial use! Derivatives of diphenylmethane were sought out to replace PCBs as electrical insulating oils for capacitors because of toxicity concerns. As described in this patent (US5134761), for this application one requires the oil to remain liquid down to -50 C. Another key requirement is the ability to absorb hydrogen gas liberated at the electrode surface (a solubility property). Since this is optimal for smaller alkyl groups on the rings, it places benzyltoluene isomers at the focal point of research for this application.
The patent states: “According to references, the melting points of the position isomers of benzyltoluenes are as follows…” but does not make a specific reference. However, by comparing the numbers with other sources we can presume that the reference is the Lemneck1954 paper I discussed previously.
The patent then uses these melting points to calculate the melting behavior of mixtures of these isomers, as they obtain without further purification from a Friedel-Crafts reaction.
If our results are correct and the melting point of 4-benzyltoluene is not +4.6 C but well below -15 C, then the calculated properties in the patent may be significantly in error as well. With the information available thus far from our experiments (UC-EXP266), we think it is unlikely that the +4.6 C value can be correct because we observed no solidification after 2 days at -15 C. The patent reports that solidification of some viscous mixtures took up to a full week but we did not observe an appreciable increase in viscosity for 4-benzyltoluene at -15 C. But in order to be sure we will first freeze the sample again below -40 C and let it warm up to -15 C in the freezer and confirm that it melts completely.
It is in light of this analysis that I make the case that open curation of melting point data is likely to be a high impact activity relative to the amount of time required to perform it. The problem is that errors such as these cascade through the scientific record and likely retard scientific progress by causing confusion and wasted effort. Consider the total cost in terms of research and legal fees for just one patent. As I discussed previously, consider the effect of compromised and contradictory data now known to exist within training sets on the pace of developing reliable melting point models (cascading down to solubility models dependent upon melting point predictions or measurements – and ultimately cascading to the efficiency of drug design).
It is important to note that the benefits of curation would be greatly diminished without the component of transparency. We are not claiming to provide a “trusted source” of melting point data. There is no such thing – and operating under the illusion of the trusted source model has resulted in the mess we are in now – with multiple melting point values for the same compound cascading and multiplying to different databases (a good and still unresolved example is benzylamine).
What we are doing is reporting all the sources we can use and marking some sources as DONOTUSE so they are not included in the calculation of the average – with an explanation. We never delete data so users can make informed choices and not be in a position of having to trust our judgement. If someone does not agree with me that failure to freeze after 2 days at -15 C does not necessarily rule out the +4.6 C value for the melting point for 4-benzyltoluene then they are free to use it.
Using a trusted source model, all values within a collection are equally valid. In the transparency model not all values are equal – we are justifiably more confident in a melting point value near -114 C for ethanol than for a melting point with a single source (like this compound).
And finally, an important factor for having an impact on science is discoverability. It is likely that someone doing research involving the melting behavior of 4-benzyltoluene would perform at least quick Google search. What they are likely to find is not just a simple number without provenance but rather a collection of results capturing the full subtlety of the situation under discussion. This is a natural outcome of working transparently.
Recommendation and review posted by G. Smith