Search Immortality Topics:

Page 1,240«..1020..1,2391,2401,2411,242..»


Category Archives: Chemistry

General Transparent Solubility Prediction using Abraham Descriptors

Making solubility estimations for most organic compounds in a wide range of solvents freely available has always been a main long term objective for the Open Notebook Science Solubility Challenge. With current expertise and technology, it should be as easy to obtain a solubility estimate as it is now to get driving directions off the web.

Obviously this won't be attained purely by exhaustive measurements, although we have been focused on strategic measurements over the past two years. In parallel, we have been constantly evaluating the various solubility models out there for suitability.
Although there are several solubility models available for non-aqueous solvents, our additional requirement for transparent model building has proved surprisingly difficult to satisfy.
From this search, the Abraham solubility model [Abraham2009] floated to the top, with an important factor being that Abraham has made available extensive compilations of descriptors for solutes and solvents. In addition the algorithms used to convert solubility measurements to Abraham descriptors (a minimum of 5 different solvents per solute) has allowed us to generate our own Abraham descriptors automatically simply by recording new measurements into our SolSum Google Spreadsheet. These can be obtained in real time as well.
This approach permitted us to provide predictions for a limited number of solvents in a wide range of solvents and we have included these predictions in the past two editions (2nd and 3rd) of the ONS Challenge Solubility Book.
Coming at the problem from a different approach, Andrew Lang has also been trying to predict solubility using only open molecular descriptors, mainly relying on the CDK. Since our most commonly used solvent has been methanol, Andy recently generated a web service to predict solubility in that solvent.
By combining these two approaches, Andy has now created a modeling system that can not only generally predict solubility in a wide range (70+) of solvents - but it can also provide related data that can be used for modeling other phenomena such as intestinal absorption of a drug or crossing the blood-brain barrier.[Stovall 2007]
The idea is to use a Random Forest approach using freely available descriptors to predict the Abraham descriptors for any solute. A separate service then generates predicted solubilities for a wide range of solvents based on these Abraham descriptors. I'm using the term "freely available" because - although the CDK descriptors and VCCLab services are open - the model requires 2 descriptors only available from ChemSpider (ultimately from ACD/Labs).
Here is an example with benzoic acid. As long as the common name resolves to a single entry on ChemSpider, it is enough to enter it and it automatically populates the rest of the fields, which are then used by the service to generate the Abraham descriptors.
Hitting the prediction link above will automatically populate the second service and generate predicted solubilities for over 70 solvents.
This approach of allowing people to access these components separately can be useful. It can be instructive to manually play with the Abraham descriptors directly to see how predicted solubilities are affected. There are also situations where one has experimentally determined Abraham descriptors for a solute and bypassing the descriptor prediction step is required.
However, for those who prefer to cut to the chase, a convenient web service is available where the common name (or SMILES) of the solute is entered and the list of available solvents appears as a drop down menu.
Now here is where I think the real payoff comes for accelerating science with openness. Andy has also created a web service that returns the predicted solubility in molar as a number from common names (or SMILES) for solute and solvent via the URL. For example click this for benzoic acid in methanol. The advantage here is that solubility prediction can be easily integrated as a web service call from intuitive interfaces such as a Google Spreadsheet to enable even non-programmers to make use of the data. Notice that the web service provided in the fourth column for the average of measured solubility values enables an easy way to explore the accuracy of specific predictions.
Such web services could also be integrated with data from ChemSpider or custom systems. If those who use these services feed back their processed data to the open web, it could take us a step closer to automated reaction design. For example consider the custom application to select solvents for the Ugi reaction. Model builders could also use the web services for predicted and measured solubility directly.
A while back we explored using Taverna for MyExperiment to create virtual libraries of SMILES. Unfortunately we ran into issues with getting the applications developed on Macs to run on our PCs. This might be worth revisiting as a means of filtering virtual libraries through different thresholds of predicted solubility.
Andy has described his model in detail in a fully transparent way - the model itself, how it was generated and the entire dataset can be found here. We would welcome improvements of the model as well as completely new models based on our dataset using only freely available tools.
It should be noted that when I use term "general" it refers to the ability for the model to generate a number for most compounds listed in ChemSpider. Obviously compounds that most closely resemble the training set are more likely to generate better estimates. Because of our synthetic objectives using the Ugi reaction we have mainly focused on collecting solubility data for carboxylic acids, aldehydes and amides either from new measurements or from the literature.
Another important point concerns the main intended application of the model: organic synthesis. Generally the range of interest for such applications is about 0.01 - 3M. This might be very different for other applications - such as the aqueous solubility of a drug, where distinctions between much lower solubilities may be important.
For a typical organic synthesis, a solubility of 0.001M or 0.005M will probably translate as effectively insoluble. This might be a desired property for a product intended to be isolated by filtration. On the other end of the scale knowing that a solubility is either 4M or 6M will not usually have an impact on reaction design. It is enough to know that a reactant will have good solubility in a particular solvent.
Given the above considerations for intended applications and the likelihood that the current model is far from optimized, the predictions should be used cautiously. We suggest that the model is best used as a "flagging device". For example, if a reaction is to be carried out at 0.5M, one may place a threshold at 0.4M for the predicted values of reactants during solvent selection, with the recognition that a predicted 0.4M may be an actual 0.55M. A similar threshold approach can be used for the product, where in this case the lowest solubility is desired. A practical example of this is the shortlisting of solvents candidates for the Ugi reaction.
Another example of flagging involves identifying the outliers in the model. These can be inspected for experimental errors and possibly remeasured. Alternatively outliers may shed light on the limitations of the model. For example we have found that the solubility of solutes with melting points near room temperature can be greatly underestimated by the current model. This may be an opportunity to develop other models which incorporate melting point or enthalpy of fusion.[Rohani 2008]
Although it is possible that better models and more data will improve the accuracy of the predictions, this can be true only if the training set is accurate enough. Based on conversations I've had with researchers who deal with solubility, reading modeling papers and our own experience with the ONS Challenge I am starting to suspect that much of the available data just isn't accurate enough for high precision modeling. Models using data from the literature are especially vulnerable I think. Take a look at this unsettling comparison between new measurements and literature values (not to mention the model) for common compounds.[Loftsson 2006] Here is a subset:
I have also made the point in detail for the aqueous solubility of EGCG. Could this be the reason that so many different solubility models using different physical chemistry principles have evolved and continue to co-exist?
The situation reminds me a lot of the discussions taking place in the molecular docking community.[Bissantz 2010] The differences in calculated binding energies are often small in comparison with the uncertainties involved. But docking can still be used as one tool among others to find drug candidates by flagging a collection of compounds above a certain threshold binding energy.
Posted in Chemistry | Comments Off on General Transparent Solubility Prediction using Abraham Descriptors

Resveratrol Thesis on Reaction Attempts

A few days ago Andrew Lang suggested to Dustin Sprouse that he submit his thesis to the Reaction Attempts database. Like many undergraduates Dustin put in a lot of time and effort in doing experiments and writing up his results but didn't have quite enough time to obtain all that would have been required for a traditional publication.

A thesis is an unusual document within the context of scientific communication. Unlike a peer reviewed paper, it may contain a large number of "failed experiments" and a substantial amount of speculation. Although it is not quite as detailed as lab notebook, there is often plenty of raw data and details about how failed or ambiguous experiments proceeded.
In Dustin's case we felt that there was enough information provided to include his thesis in Reaction Attempts. In addition, his thesis was accepted by Nature Precedings, thus providing a convenient means of citation.
The first component of the Reaction Attempts project is to quickly abstract the most basic information from synthetic organic chemistry reactions. This includes the ChemSpiderIDs and SMILES from the reactants and target products and brief notes about conditions and outcomes. We are especially interested in failed or ambiguous experiments because these have almost no chance of being communicated and indexed in the traditional systems. When attempting to carry out a reaction, it can be just as useful to know what doesn't work - and more specifically how it doesn't work.
The second component of the project is dissemination. Because the information is encoded semantically, it can be automatically converted to both human and machine readable formats.
One human interface consists of a PDF book (also as a hard copy), with the option of selected reactions specified by listing CSIDs of reactants in the URL. For example Dustin's reactions can be presented selectively here. We also have a Reaction Explorer, where reactants or products can be selected from a dropdown menu or via a substructure search.
We also provide live XML feeds so that others can create applications easily from machine readable data. For example one could create reaction chains automatically, which will occur whenever we enter reactions from multi-step syntheses like Dustin's - based on the synthesis of resveratrol.
I know that Peter Murray-Rust has been very active in automatically abstracting information from chemistry theses. It would be interesting to see how that approach would work for this thesis, especially with the failed experiments. Reducing a page or two of text into only the most salient bits of information manually required a level of judgement that I imagine would be tricky to do automatically.
Posted in Chemistry | Comments Off on Resveratrol Thesis on Reaction Attempts

Secrecy in Astronomy and the Open Science Ratchet

Probably because of the visibility of the GalaxyZoo project, I think several of my colleagues and I have been under the impression that astronomy is a somewhat more open field than chemistry or molecular biology. It was easy to rationalize such a position because patents are not an issue, as they clearly are in fields which rely more on invention than discovery. However, after reading "The Case for Pluto" by Alan Boyle, I am left with a much different impression.

This book does an excellent job of covering the recent debate over Pluto's designation as a true planet. A key trigger for this debate has been the discovery of dwarf planets with sizes very close to that of Pluto. However, these discoveries did not occur without controversy.

The story of the controversy regarding the discovery of Haumea is a particularly good example (starts on p. 108 of the book - a good summary also on Wikipedia). Starting in December 2004 Michael Brown at Caltech discovered a series of new dwarf planets. Instead of immediately reporting his team's discoveries, he worked in secrecy until July 20, 2005 when he posted an online abstract indicating the discoveries would be announced at a conference that September. However, on July 27, 2005 a Spanish team led by José Luis Ortiz Moreno filed a claim with the Minor Planet Center for priority in discovering one of these dwarf planets. This forced Brown's hand in disclosing his team's other discoveries within days - much sooner than he had anticipated.

Apparently this stirred up a great controversy in the community and officially no name was associated with the discovery, although the Spanish team's telescope at Sierra Nevada Observatory was recognized as the location of the discovery. However, Brown was allowed to select the name Haumea for the dwarf planet.

Even though the Minor Planet Center accepted Moreno's submission, most reports seem to side with Brown. The main argument is no less than academic fraud on Moreno's part because he accessed public telescope logs and found some of Brown's data. It was as simple as Googling the identifier that Brown inserted in his public abstract.
If Moreno had hacked into a private computer from Brown's team I can understand fraud. But is it fraud to access public databases? We chemists do that all the time - reading abstracts from upcoming conferences to try to glean what our competitors are up to. That hasn't stopped anyone from submitting a paper or patent.
Secrecy only works if everyone competing follows the same rules. If there is a rule that planet discoveries must be made at conferences or by formal publication then this could not have happened. Moreno's submission to the Minor Planet Center should have been rejected if such a rule existed. If there is a rule that telescope logs should not be accessed then why make them public and indexed on Google?
Now there may exist field specific conventions. I don't know what they are in the case of discoveries such as these but here is an interesting quote from Michael Brown's Wikipedia page:

When asked about this online activity, Ortiz responded with an email to Brown that suggested Brown was at fault for "hiding objects," and said that "the only reason why we are now exchanging e-mail is because you did not report your object."[3] Brown says that this statement by Ortiz contradicts the accepted scientific practice of analyzing one's research until one is satisfied that it is accurate, then submitting it to peer review prior to any public announcement. However, the MPC only needs precise enough orbit determination on the object in order to provide discovery credit, and Ortiz et al. not only provided the orbit, but "precovery" images of the body in 1957 plates.

It seems to me that there is a clash of what are the conventions in the field. Certainly the Minor Planet Center did not recognize the convention of peer review before public disclosure. They only required sufficient proof for the discovery.

One way to look at this story is that Moreno acted more openly than Brown by disclosing information before peer review. This action forced Brown to disclose scientific results much more quickly than he had anticipated.

In a sense this is a type of Open Science Ratchet. The actions of scientists that are most open set the pace for everyone else working on that particular project, regardless of their views on how secretive science should be.

Imagine how the scenario would have played out if one of the groups had used an Open Notebook. On December 28, 2004 everyone with a stake in the search for planets would have had the opportunity to know that a very significant find had been made. There were still details to work out - and the Brown group might not be the first to do all the calculations to completely characterize the discovery. Certainly it would affect what other researchers did - even if they were completely opposed to the concept of Open Science.

Essentially secrecy in this context is an all-or-nothing gamble. Everyone is free to not disclose their work until after peer reviewed publication. In some cases the discoverer will get full credit for the discovery and the complete analysis. But in other cases another group working in parallel will publish first and leave nothing to claim.
As scientists become more open, it is likely that their ability to claim sole priority for all aspects of a discovery will be reduced. However, they will retain priority for the observations and calculations that they made first.
The more open the science, the faster it happens. And because of the Open Science Ratchet, a few Open Scientists scattered across various fields could have a larger hand than expected in speeding up science.
Posted in Chemistry | Comments Off on Secrecy in Astronomy and the Open Science Ratchet

Methanol Solubility Prediction Model 4 for Ugi reactions in the literature

Since non-aqueous solubility measurements have not become part of the standard characterization of organic compounds, it is not surprising that all the data we have for Ugi products originate from measurements that we made on our own compounds.

Since methanol is our most common solvent, Andrew Lang has collected the measurements that we have with values from the literature for a range of compounds, including our Ugi products, to generate a web service returning a predicted solubility based on a submitted SMILES string. The model (Model 4) was derived from a Random Forest algorithm, using molecular descriptors supplied by the CDK and VCC.

It would be nice to be able to test the model's ability to predict what will happen if a Ugi reaction is carried out in methanol. Although the actual solubility of Ugi products in the literature is typically not reported, reading the experimental sections in papers can still provide some validation of the model in some cases.

For example, consider the following Ugi products synthesized recently by Lezinska (Tetrahedron 2010)


Note that these images represent the azide group not following the octet rule. It is necessary to represent the structure SMILES without charges because the CDK and VCC web services used by the model do not process charges correctly. Stereochemistry also cannot be used and this can be removed from the SMILES simply by deleting slashes. Thus for the two molecules above the SMILES to be submitted to the prediction web service are:

O=C(NC1CCCCC1)C(Cc2ccc(C)cc2)N(c4ccccc4C(=O)c3ccccc3)C(=O)C(Cc5ccccc5)N=N#N
AND
O=C(NC1CCCCC1)C(C(=O)c2ccccc2)N(Cc3ccc(C)cc3)C(=O)C(C)CCN=N#N

The predicted methanol solubilities are respectively 0.004 M and 0.03 M.

Now if we look at the details in the experimental section, both of these Ugi products were synthesized in methanol at a limiting reactant concentration of about 0.1 M. Even though this is much more dilute than the usual 0.5-2.0 M generally recommended for Ugi reactions (Domling 2000), the products still precipitate and can be filtered off. This is consistent with the predicted solubilities above and the model would have suggested ahead of time that methanol might be a good solvent for isolation of the products by precipitation.

So far these are just anecdotal results but it does illustrate that solubility models can be evaluated without explicit determination of solubility in the literature.

Posted in Chemistry | Comments Off on Methanol Solubility Prediction Model 4 for Ugi reactions in the literature

Reaction Attempts Explorer

Two months ago I reported on the Reaction Attempts project and the availability of the summary as a physical or electronic (PDF) book. The basic idea behind the project is to collect organic chemistry reaction attempts reported in Open Notebooks. This would include not only successful experiments but also those which could be categorized as failed, ambiguous, in progress, etc.

The book was organized with reactants listed alphabetically. In this way one could browse through summaries of the types of reactions being attempted by different researchers on a reactant of interest. There might be information there (what to do or what to avoid) of some use for a planned reaction. At the very least one could contact the researcher to initiate a discussion about work that had not yet been published in the traditional system.

Andrew Lang has just created a web-based tool to explore the Reaction Attempts database in much more sophisticated ways.

Here are some scenarios of how one could use it. On the left hand side of the page is a dropdown menu containing an alphabetically sorted list of all the reactants and products in the database. Lets select furfurylamine.


This immediately informs us that there are 230 reactions involving furfurylamine and it lists the schemes for all these reactions upon scrolling down. That's still a bit hard to process so a second dropdown menu appears populated with a list of other reactants or products involved with furfurylamine.

We now select boc-glycine and that narrows our search to 145 reactions.

Selecting benzaldehyde from the third dropdown menu narrows the search further to 61 reactions.

The final dropdown menu contains a short list of only isocyanides and thus all represent attempted Ugi reactions. Selecting t-butyl isocyanide gives us 56 reactions.

That means that these same 4 components were reacted together 56 times. Looking at the various reaction summaries will show that some of these are duplicates for reproducibility and others vary concentration and solvent and the effect on yield is included. This particular reaction was in fact the subject of a paper on the optimization of a Ugi reaction using an automated liquid handler.

Now here is where the design of the Explorer comes in handy. We might want to ask if the reaction proceeds as well with the other isocyanides. All we have to do is switch the final dropdown menu to ask what happens when we go from t-butyl to n-butyl isonitrile. There is a single attempt of this reaction and it is "failed" in the sense that no precipitate was obtained from the reaction mixture. This doesn't mean that the reaction didn't take place - it might be that the Ugi product was too soluble. We can quickly inspect that the concentration and solvent are in line with conditions that allowed precipitation of the t-butyl derivative.

OK lets see what happens with n-pentyl isocyanide.

It looks like it behaves just like n-butyl isocyanide: another single non-precipitation event. What about benzyl isocyanide?

This time we do get the Ugi product from a single attempt. Note the lower yield compared to the t-butyl isocyanide under similar conditions.

What about with cyclohexyl isocyanide?

This time we hit an experiment in progress. A precipitate was obtained but it was not characterized. We can click on the link to the lab notebook page (EXP232) to learn more about how long it took for the precipitate to appear but there are not enough data to draw a definite conclusion about the successs of the reaction. However, based on the results from the other precipitates in this series it is probably encouraging enough to repeat and characterize the product.

There are other sources of information here. Clicking on the image of the Ugi product takes us to its ChemSpider entry. In this case the only associated data relates to this reaction attempt.

Lets look at another scenario: reactions involving aminoacetaldehyde dimethyl acetal.

In this case we find the intersection of two Open Notebooks. The first reaction comes from Michael Wolfle from the Todd group.

The second comes from Khalid Mirza from the Bradley group.

In order to learn more about the nature of the overlap we can use the substructure search capabilities of the Reaction Explorer. Simply click on the image of the acetal and the ChemSpider entry pops up. Now click on the copy button next to the SMILES for the compound.

Paste the SMILES into the SMARTS box of the Reaction Explorer.

We get 13 reaction attempts for this query - the two we found earlier and the rest corresponding to attempts by Michael Wolfle to synthesize praziquanamine.

We learn that one connection between these two notebooks involves different attempts at synthesizing praziquantel.

Hopefully this demonstrates the value of abstracting organic chemistry reaction attempts from Open Notebooks into a machine readable format. Contributions to the database require only the ChemSpider IDs of the reactants and product and a link to the relevant lab notebook page. Reaction schemes are automatically generated by the system. More on the Reaction Attempts project here.

Posted in Chemistry | Comments Off on Reaction Attempts Explorer

IGERT NSF panel on Digital Science

On May 24, 2010 I was part of a panel in Washington for the NSF IGERT annual meeting. As I mentioned previously, it is encouraging to find that funding agencies are paying more attention to the role of new forms of scholarship and dissemination of scientific information.

My co-panelists included Janet Stemwedel, who talked about the role of blogging in an academic career, Moshe Pritzker, who made a case for using video to communicate protocols in life sciences and Chris Impey, who demonstrated applications of clickers and Second Life in the classroom.

We only had 10 minutes each to speak so the presentations were basically highlights of what is possible. Still, it was enough to stimulate a vigorous discussion with the audience. There was a bit of controversy about the examples I used to demonstrate the limitations of peer review in chemistry. People can misinterpret what we are trying to do with ONS - it certainly doesn't include bringing down the peer review system (not that we could anyway). But we have to face the situation that peer review does not validate all the data and statements in a paper. It operates at a much higher level of abstraction. Providing transparency to the raw data should work in a synergistic way with the existing system.

My favorite part of the conference was easily Seth Shulman's talk on the "Telephone Gambit". Ever since reading his book, I have been using the story of how carefully reading Bell's lab notebook has forced us to revise the generally accepted notion of how the telephone was invented. Seth's presentation was truly captivating because he explained not only what was done but also what motives were at work to deceive and obfuscate. This cautionary tale is still very much relevant to science and invention today - and highlights how transparency can mitigate against this type of outcome.

Posted in Chemistry | Comments Off on IGERT NSF panel on Digital Science