Collaboration using Open Notebook Science in Academia book chapter

I am very pleased to report that the book chapter that I co-wrote with Andrew Lang, Steve Koch and Cameron Neylon is now available online: Collaboration using Open Notebook Science in Academia. This is the 25th chapter of Collaborative Computational Technologies for Biomedical Research, edited by Sean Ekins, Maggie Hupcey, Antony Williams and Alpheus Bingham.

Our chapter provides some fairly detailed examples of how Open Notebook Science can be used to enhance collaboration between researchers from both similar or distant fields. It also suggests certain paths towards machine/human collaboration in science. Hopefully it will encourage researchers who have an interest in Open Science to experiment with some of the tools and strategies mentioned.

I am also grateful to Wiley for choosing our chapter as the free online sample for the book!

This book discusses the state-of-the-art collaborative and computing techniques for the pharmaceutical industry, the present and future implications and opportunities to advance healthcare research. The book tackles problems thoroughly, from both the human collaborative and the data and informatics side, and is very relevant to the day-to-day activities running a laboratory or a collaborative R&D project. It can be applied to help organizations make critical decisions about managing drug discovery and development partnership. The book follows a “man- methods-machine” format with sections on how to get people to collaborate, collaborative methods, and computational tools for collaboration. This book offers the reader a “getting started guide” or instruction on “how to collaborate” for new laboratories, new companies, and new partnerships, as well as a user manual for how to troubleshoot existing collaborations.

ACS and ACRL presentations on web services and trust in science

On March 30 and 31, 2011 I presented two related talks - the first remotely for the American Chemical Society (ACS) Meeting and the second in Philadelphia at the meeting for the Association of College and Research Libraries (ACRL).

In the ACS talk "Rapid Dissemination of Chemical Information for people and machines using Open Notebook Science", I spoke for the first time in detail about the results of the open modeling Andrew Lang and I carried out on the open dataset of melting points we collected starting with the Alfa Aesar dataset recently made public.

We used Skype and Google Presenter with the help of Peter Murray-Rust on site at the conference and it went fairly well I think. Henry Rzepa had a good question about polymorphism possibly being responsible for different melting points from various sources. I don't think that is the problem in most of these cases but we can certainly spend some time investigating the reports of polymorphism for cases where the information is available. One of the big problems is that we don't know the history of the sample used for a melting point from most sources like chemical vendor sites. At least in journal articles we might be told which solvent was used to crystallize the sample. If multiple sources agree on a certain melting point and there is one outlier, I think it is reasonable to assume that the common melting point is likely to correspond to the thermodynamically favored polymorph. This might not be correct in all cases but - without the means to discover more information about the sample histories - I think it makes sense to proceed in this way. Since we don't consider polymorphism in our modeling, there is an implicit assumption that - in the case of polymorphism - we are dealing with the thermodynamically most stable form.

My ACRL talk "Is there a role for Trust in Science?" focused more on the Chemical Information Validation study and outcomes. There were several good questions at the end. One particularly good comment addressed my speculation that within a few years, the open models in most of the useful chemical spaces will be sufficiently good that it will be as easy to Google a melting point or a solubility as it is now to get driving directions. The question was: weren't we just replacing trust from one information source to another, namely these models. I don't think the concept of trust applies in these cases because the training sets, the descriptors and the performance of the models are (and will be) open. This is in sharp contrast with most commercial software generating predictions for solubility and melting points - these are generally black boxes because either the training set, the model or the descriptors are not open.

Evan Curtin is the May 2011 RSC ONS Challenge Winner

Evan Curtin, a chemistry freshman student working under the supervision of Jean-Claude Bradley at Drexel University, is the May 2011 Royal Society of Chemistry Open Notebook Science Challenge Award winner. He wins a cash prize from the RSC.

Evan's primary focus has centered on synthesizing aromatic imines and measuring their solubility in a number of organic solvents. This will allow us to generate Abraham descriptors for this class of compounds in order to predict their solubility in 70+ solvents. Coupled with our new model to include temperature dependent solubility, this should greatly facilitate optimal solvent prediction for this and related reactions.

Imine formation is of particular interest to the UsefulChem group because it is the first step of the Ugi reaction, which we have used to synthesize compounds with anti-malarial activity. But it is also a simple convenient reaction in itself to test our Solvent Selector's ability to predict optimal conditions (solvent and temperature) for isolation of products by precipitation.

Evan's synthesis experiments are available here:
http://usefulchem.wikispaces.com/Exp263
http://usefulchem.wikispaces.com/Exp262
http://usefulchem.wikispaces.com/Exp261


and his solubility experiments are listed here:

http://onschallenge.wikispaces.com/Exp207
http://onschallenge.wikispaces.com/Exp206
http://onschallenge.wikispaces.com/Exp205
http://onschallenge.wikispaces.com/Exp204
http://onschallenge.wikispaces.com/Exp201
http://onschallenge.wikispaces.com/Exp198
http://onschallenge.wikispaces.com/Exp197

Three more RSC ONS Awards will be made during 2011. Submissions from students in the US and the UK are still welcome.
For more information see:
http://onschallenge.wikispaces.com
http://onschallenge.wikispaces.com/RSCAwards2010

Validating Melting Point Data from Alfa Aesar, EPI and MDPI

I recently reported that Alfa Aesar publicly released their melting point dataset for us to use to take into account temperature in solubility measurements. Since then, Andrew Lang, Antony Williams and I have had the opportunity to look into the details of this and other open melting point datasets. (See here for links and definitions of each dataset)

An initial evaluation by Andy found that the Alfa Aesar collection yielded better correlations with selected molecular descriptors compared to the Karthikeyan dataset (originally from MDPI), an open collection of melting points used by several researchers to provide predictive melting point models. This suggested that the quality of the Alfa Aesar dataset might be higher.

Inspection of the Karthikeyan dataset did reveal some anomalies that may account for the poor correlations. First there were several duplicates - identical compounds with different melting points, sometimes radically different (up to 176 C). A total of 33 duplicates (66 measurements) were found with a difference in melting points greater than 10 C.(see ONSMP008 dataset) Here are some examples.


A second problem we ran into involved difficulty processing the SMILES in the Karthikeyan collection. Most of these involved SO2 groups. An attempt to view this SMILES string in ChemSketch ends up with two extra hydrogens on the sulfur.

[S+2]([O-])([O-])(OCC#N)c1ccc(C)cc1

Other SMILES strings render with 5 bonds on a carbon and ChemSketch draws these with a red X on the problematic atom. See for example this SMILES string:

O=C(OC=1=C2C=CC=CC2=NC=1c1ccccc1)C


Note that the sulfur compounds appear to render correctly on Daylight's Depict site:

In total 311 problematic SMILES from the Karthikeyan collection were removed (see ONSMP009).

With the accumulation of melting point sources, overlapping coverage is revealing likely incorrect values. For example, 5 measurements are reported for phenylacetic acid.

Four of the values cluster very close to 77 C and the other - from the Karthikeyan dataset - is clearly an outlier at 150 C.

In order to predict the temperature dependence for the solutes in our database, Andy collected the EPI experimental melting points, which are listed under the predicted properties tab in ChemSpider (ultimately from the EPA). (There are predicted EPI values there but we only used the ones marked exp).

This collection of 150 compounds was then listed in a spreadsheet (ONSMP010) and each entry was marked as having only an EPI value (44 compounds) or having at least one other measurement from another source (106 compounds). Out of those having at least one more value, 10 reported significant differences (> 5C) between the measurements. Upon investigation, many of these point strongly to the error lying with the EPI dataset. For example, the EPI melting point for phenyl salicylate is over 85 C higher than that reported by both Sigma-Aldrich and Alfa Aesar.


These preliminary results suggest that as much as 10% of the EPI experimental melting point dataset is significantly in error. Only a systematic analysis over time will reveal the full extent of the deficiencies.

So far the Alfa Aesar dataset has not produced many outliers, when other sources are available for comparison. However, even here, there are some surprising results. One of the most well studied organic compounds - ethanol - is listed with a melting point of -130 C by Alfa Aesar, clearly an outlier from the other values clustered around -114 C.

When downloading the Karthikeyan dataset from Cheminformatics.org, a Trust Level field indicates: "High - Original Author Data".

It would be nice if it were that simple. Unfortunately there are no shortcuts. There is no place for trust in science. The best we can do is to collect several measurements from truly independent sources and look for consensus over time. Where consensus is not obvious and information sources are exhausted, performing new measurements will be the only option left to progress.

The idea that a dataset has been validated - and can be trusted completely - simply because it is attached to a peer-reviewed paper is a dangerous one. This is perhaps the rationale used by projects such as Dryad,
where datasets are not accepted unless they are associated with a peer-reviewed paper. Peer review was not designed to validate datasets - even if we wanted it to, reviewers don't typically have access to enough information to do so.

The usefulness of a measurement is related much more to the details in the raw data provided by following the chain of provenance (when available) than it is in where it is published. To be fair, in the case of melting point measurements, there really isn't that much additional experimental information to provide, except perhaps an NMR of the sample to prove that it was completely dry. In such a case, we have no choice but to use redundancy until a consensus number is finally reached.

Open modeling of melting point data

The contribution of Alfa Aesar melting point data to our open collection has facilitated the validation of a significant amount of the entire dataset. However, this process of curation is never-ending. A good example is the discovery of an error in one of the sources for the melting point of warfarin. Following David Weinberger's post about our melting point explorer, his brother Andy noticed a problem and this enabled us to fix it.

In a way, creating an open environment to make it easy to find and report errors - as well as add new data - complicates scientific evaluation. In order to report a reproducible process and outcome, it is necessary to take a snapshot of the dataset. Choosing the exact composition of a dataset for a particular application is somewhat arbitrary. Aside from selecting a threshold for excluding measurements that deviate too much, compounds may be excluded based on their type.

For the sake of clarity, we archived the various datasets we created from multiple sources with brief descriptions of the filtering and merging at each step. From the perspective of an organic chemist, ONSMP013 is probably the most useful at this time. It contains averaged measurements for 12634 organic compounds and excludes salts, inorganics or organometallics. The original file provided by Alfa Aesar contained several of these excluded compounds and can be obtained from ONSMP000. It might be interesting at some point to create a collection of melting points for inorganics or salts. We would welcome contributions of collections of melting points with different filters.

One of the advantages of ONSMP013 is that it is possible to generate CDK descriptors for each entry (and these are included in the spreadsheet). By not using commercial software to generate descriptors, it enables fully transparent modeling - and extension of that modeling by anyone.

With this in mind, Andrew Lang has used ONSMP013 to generate a Random forest melting point model (MPM002). The most important descriptors turned out to be the number of hydrogen bond donors and the Topological Polar Surface Area (TPSA). The scatter plot below shows the correlation (R2 = 0.79) between the predicted and experimental values. (color represents TPSA and size relates to H-bond donors)


Andy has described in much more detail the rationale for selecting the Random forest approach over a linear model in MPM001. He has also compared the performance of CDK descriptors versus those from a commercial program for a small set of drug melting points in MPM003.

The Random forest model (MPM002) is also now available as a web service by entering the ChemSpiderID (CSID) of a compound in a URL. See this example for benzoic acid. If experimental results exist they will appear on top and a link to obtain the predicted melting point will appear underneath.

Note that the current web service for predicting melting points can be slow - it may take a minute to process.

Additional web services for melting point data will be listed on the ONS web services wiki.

Towards the automated discovery of useful solubility applications

Last week, I came across (via David Bradley) a paper by an MIT group regarding the desalination of water using a very clever application of solubility behavior:

Anurag Bajpayee, Tengfei Luo, Andrew Muto and Gang Chen, Energy Environ. Sci., 2011 Very low temperature membrane-free desalination by directional solvent extraction (article, summary)

The technique simply involves the heating of saltwater with molten decanoic acid to 40-80 C. Some water dissolves into the decanoic acid, leaving the salt behind. The layers are then separated and, upon cooling to 34C, sufficiently pure water separates out. Any traces of decanoic acid are inconsequential since this compound is already present in many foods at higher levels.

From a technological standpoint, I can't think of a reason why this solution could not have been discovered and implemented 100 years ago. It makes you wonder how many other elegant solutions to real problems could be uncovered by connecting the right pieces together.

To me, this is where the efforts of Open Science and the automation of the scientific process will pay off first. For this to happen on a global level, two key requirements must be met:

1) Information must be freely available, optimally as a web service (measurements if possible - otherwise a predicted value, preferably from an Open Model)
2) There has to be a significantly automated way of identifying what is important enough to be solved.

Since we have been working on fulfilling the first requirement for solubility data, I first looked at our available services to see if there was anything there that could have pointed towards this solution.

Although we have a measured (0.0004 M) and predicted (0.001 M) room temperature solubility of decanoic acid in water, our best prediction service can't do the opposite: the solubility of water in decanoic acid. For that we would need the Abraham descriptors for decanoic acid as a solvent and those are not yet available as far as I'm aware.

Also, we use a model to predict solubility at different temperatures - but it assumes that the solute is miscible with the solvent at its melting point. This is probably a reasonable assumption for the most part but it fails when the solute and the solvent are too radically dissimilar (e.g. water/hydrophobic organic compounds). In this particular application, decanoic acid melts at 31C and the process occurs in the 34-80 C range.

But even if we had the necessary models (and corresponding web services) for the decanoic acid/water/NaCl system, could it have been flagged in an automated way as being potentially "useful" or even "interesting"?

For utility assessment, humans are still the best source. Luckily, they often record this information tagged with common phrases in the introductory paragraphs of scientific documents. (In fact, this is the origin of the UsefulChem project). For example, if we search for "there is a pressing need for" AND solubility in a Google search, most of the results provide reasonable answers to the question of what a useful application of solubility might be. I have summarized the initial results in this sheet.

The first result is:
"there is a pressing need for new materials for efficient CO2 separation" from a Macromolecules article in 2005. The general problem needing solving would correspond to "global warming/CO2 sequestration" and the modeling challenge would be "gas solubility".

Analyzing the first 9 results in this way gives us the following problem types:

  1. global warming/CO2 sequestration
  2. fire control
  3. global warming/refrigeration fluid
  4. AIDS prevention
  5. Iron absorption in developing countries
  6. agriculture/making phosphate from rock bioavailable
  7. water treatment/flocculation
  8. natural gas purification/environmental
  9. waste water treatment

and the following modeling challenges:

  1. gas solubility
  2. polymer solubility
  3. hydrofluoroether solubility
  4. solubility of drug in gels
  5. inorganics
  6. inorganics/pH dependence of solubility
  7. polymer solubility/flocculation/colloidal dispersions
  8. gas solubility
  9. inorganics

These preliminary results are instructive. The problem types are broad and varied - and I think they will be helpful for keeping in mind as we continue to work on solubility. The modeling challenges can be compared directly with our existing services - and none of them overlap at this time! All of these involve either gasses, polymers, gels, salts, inorganics or colloids while our services are strictly for small, non-ionic organic compounds in liquid solvents.

Part of the reason for our focus on these types of compounds relates to our ulterior objective of assessing and synthesizing drug-like compounds. But a more important consideration is what type of information is available and what can be processed related to cheminformatics. Currently most cheminformatics tools deal only with organic chemicals, with essential sources such as ChemSpider and the CDK providing measurements, models, descriptors, etc.

Even though some inorganic compounds are on ChemSpider, most of the properties are unavailable. Consider the example of sodium chloride:


This doesn't mean that the situation is hopeless but it does make the challenge much more difficult. Solubility measurements and models for inorganic salts do exist (for example see Abdel-Halim et al.) but they are much more fragmented.

With the feedback we obtain from this search phrase approach - and hopefully help from experts in the chemistry community - we can piece together a federated service to provide reasonable estimates for most types of solubility behavior.

I think that this desalination solution will prove to
be a good test for automated (or at least semi-automated) scientific discovery in the realm of open solubility information. In order to pass the test, the phrase searching algorithm should eventually identify desalination as a "useful problem to solve" and should connect with the predicted behavior of water/NaCl/decanoic acid (or other similar compound).

Luckily we have Don Pellegrino on board. His expertise on automated scientific discovery should prove quite valuable for this approach.

Alfa Aesar melting point data now openly available

A few weeks ago, John Shirley - Global Marketing Manager at Alfa Aesar - contacted me to discuss the Chemical Information Validation results I posted from my 2010 Chemical Information Retrieval class. Our research showed that Alfa Aesar was the second most common source of chemical property information from the class assignment.
We explored some possible ways that we could collaborate. With our recent report of the use of melting point measurements to predict temperature solubility curves, the Alfa Aesar melting point data collection could prove immensely useful for our Open Notebook Science solubility project.

However, since we are committed to working transparently, the only way we could accept the dataset is if it were shared as Open Data. I am extremely pleased to report that Alfa Aesar has agreed to this requirement and we hope that this gesture will encourage other chemical companies to follow suit.

The initial file provided by Alfa Aesar did not store melting points in a database ready format - it included ranges, non-numeric characters and entries reporting decomposition or sublimation. One of benefits we could provide back to the company was cleaning up the melting point field to pure numerical values ready for sorting and other database processing. This processed collection contains 12986 entries. Note that these entries are not necessarily different chemical compositions since they refer to specific catalog entries with different purities or packaging.

For our purposes of prioritizing organic chemicals for solubility modeling and applications we curated this initial dataset by collapsing redundant chemical compositions and excluded inorganics (including organometallics) and salts. We did retain organosilicon, organophosphorus and organoboron compounds. Because the primary key for all of our projects depend on ChemSpiderIDs, all compounds were assigned CSIDs by deposition in the ChemSpider database if necessary. SMILES were also provided for each entry, as well as a corresponding link to the Alfa Aesar catalog page. This curated collection contains 8739 entries.

For completeness, we thought it would be useful to merge the Alfa Aesar curated dataset with other collections for convenient federated searches. We thus added the Karthikeyan melting point dataset, which has been used in several cases to model melting point predictions. This dataset was downloaded from Cheminformatics.org. Although we were able to use most of the structures in that collection, a few hundred were left out because of some difficulty in resolving some of the SMILES, perhaps related to the differences in algorithms used by OpenBabel and OpenEye. Hopefully this issue will be resolved in a simple way and the whole dataset can be incorporated in the near future. This final curated collection contains 4084 entries.

Similarly the smaller Bergstrom dataset was included after processing the original file to a curated collection of 277 drug molecules.

Finally, the melting point entries from the ChemInfo Validation sheet itself, generated by student contributions, is added to amount to a collection of currently 13,436 Open Data melting point values. We believe that this is currently the largest such collection and that it should facilitate the development of completely transparent and free models for the prediction of melting points. As we have argued recently, improved access to measured or predicted melting points is critical to the prediction of the temperature dependence of solubility.

In addition to providing the melting point data in tabular format, Andrew Lang has created a convenient web based tool to explore the combined dataset. A drop down menu at the top allows quick access to a specific compound and reports the average melting point as well as a link to the information source. In the case of an Alfa Aesar source, a link to the catalog is provided, where the compound can be conveniently ordered if desired.
In another type of search, a SMARTS string can be entered with an optional range limit for the melting points. In the following example 14 hits are obtained for benzoic acid derivatives with melting points between 0C and 25C. Clicking on an image will reveal its source. (BTW even if you don't know how to perform sophisticated SMARTS queries, simply looking up the SMILES for a substructure on ChemSpider or ChemSketch will likely be sufficient for most types of queries).

Preliminary tests on a Droid smartphone indicate that these search capabilities work quite well.

Finally, I would like to thank Antony Williams, Andrew Lang and the people at Alfa Aesar (now added as an official sponsor) who contributed many hours to collecting, curating and coding for the final product we are presenting here. We hope that this will be of value to the researchers in the cheminformatics community for a variety of open projects where melting points play a role.

ONS Solubility Challenge Book cited in a Langmuir nanotechnology paper

An interesting application of the data from the Open Notebook Science Solubility Challenge has recently been reported in Langmuir: "Enhanced Ordering in Gold Nanoparticles Self-Assembly through Excess Free Ligands" by Cindy Y. Lau, Huigao Duan, Fuke Wang, Chao Bin He, Hong Yee Low and Joel K. W. Yang (Feb 24, 2011).

The context is as follows, and the reference is to Edition 3 of the ONS Solubility Challenge Book.

Although to our best knowledge there lacks literature value of OA solubility in the two solvents, the 10-fold better solubility of 1-otadecylamine (sic), the saturated version of oleylamine, in toluene than hexane is in line with our hypothesis.(33) This increased solubility caused the OA molecules that were originally attached to the AuNPs to gradually detach from the AuNPs, which is supported by our observations in poor AuNP stability and surface-pressure isotherms.

This is a nice application of solubility to understand and control the behavior of gold nanoparticles. It is in line with some of the applications I discussed at a recent Nanoinformatics conference, where I think there is a place for the interlinking of information between solubility and nanotechnology databases.

I have to admit that it is somewhat ironic to see this citation in Langmuir, given the controversy about a year ago (post and FF discussion) regarding the citation of non-traditional literature.

Science Online 2011 Thoughts

On January 15, 2011 I co-moderated a Science Online 2011 session on Open Notebook Science with Antony Williams and Carl Boettiger. The projector failed so we did our best to introduce the topic without relying on visual aids. My main objective was to demonstrate that it is not necessary for researchers (or their machines) to interface with the actual lab notebook to benefit from the information generated from the work. By introducing simple and rapid abstraction steps, both solubility and reaction information can be converted to web services for a variety of uses. As long as a link to the original lab notebook page (including the raw data) is attached, no information is lost and details can be investigated on demand.

One of the most powerful tools to use in this context is the tracking of chemical entities as ChemSpider IDs. This enables direct access to many other web services which Andrew Lang and I have leveraged to generate our own services. Tony spoke a bit more about this in his part and outlined some of the benefits and frustrations with crowdsourcing. Carl spoke eloquently about his experiences with Open Notebook Science as a graduate student for computational projects. The slides from all of us are provided below.

The overall tone of the discussion during our session was quite positive and productive. This was the case with all of the other sessions I attended, as it has been in prior years. The Science Online conference has evolved to attract a large proportion of people advocating Open Science. The presenters and the audience feel that they are among friends and the result is usually a free and easy exchange of ideas. Not all conferences and symposia relating to the online aspects of science share this. I have seen many examples where the "online science" theme is overrun by Closed Science proponents, for example commercial databases or Electronic Laboratory Notebook (ELN) vendors. Hopefully this conference will retain its Open Science focus in the future.

Kaitlin Thaney proved to be a very effective moderator during her session on "The Digital Toolbox: What's Needed?" and she stirred up some insightful discussion. I also enjoyed Steve Koch's session (co-moderated with Kiyomi Deards and Molly Keener) on "Data Discoverability: Institutional Support Strategies". Steve shared a particularly compelling example of the collaborative benefits of Open Notebook Science, where a computational research group came across images and videos from one of his group's notebooks and incorporated these in their paper - with all due credit acknowledged.

I very much appreciated the opportunity to catch up with old friends and some new. I had never met Carl Boettiger in person before and we had some very interesting discussions about Open Science and Open Education. It was good to meet Mark Hahnel from FigShare and explore possible paths for data sharing. I had some nice chats with Antony Williams, Steve Koch, Steven Bachrach, Heather Piwowar and Ana Nelson.

The Saturday evening banquet proved to be surprisingly entertaining. Despite the sedate title of her talk, "Out on a Limb: Challenges of Training Scientists to Communicate", Meg Lowman pounded the audience with a hilarious performance. Science comedian Brian Malow kicked this up a notch with some very clever material. Later on, using a brilliant comedic judo technique, he repeated some choice derisive comments he received from his performances on YouTube. I hope he comes back next year!

The Spectral Game with ChemDoodle

In the summer of 2009, we published an article on the Spectral Game. This game is based on spectra uploaded as Open Data (in JCAMP-DX format) on ChemSpider (currently about 2000 H NMRs and a few C NMRs, IRs and NIRs). Students get points by clicking on the molecule associated with the spectrum on display.

Although this has proved to be a useful tool to teach spectroscopy (especially H NMR), there have been some limitations, which are related to the use of Java (JSpecView) to provide an interactive display of the spectra.

1) Spectra do not display properly on Macs - there are problems with the "right-click" options in JSpecView. It took me a really long time to understand why some of my friends were really unimpressed by JSpecView. When I recognized that they were all Mac users I took a look and it became clear.

2) Spectra do not display at all on smartphones because of the Java components

I am very happy to report that these issues have been overcome (for the most part) using ChemDoodle. Through a collaborative effort between Kevin Theisen, Andrew Lang, Antony Williams and myself, we now have a non-Java based version of the Spectral Game at SpectralGame.com.

The game plays well on Mac, iPhone and iPad. However I have seen it fail on 2 Androids so there are still a few kinks to work out. Luckily I happen to be teaching NMR right now in my organic chemistry course so my students will be testing out the ChemDoodle version extensively.

There are some really nice additional features as well. My favorite is the auto-scaling of the integration line when zooming in. In the JSpecView version, integration is problematic because, when zooming into high field peaks, the start of the integration line does not reset to zero and this requires several iterations of changing the integration offset to get a usable measurement.

Another advantage in the ChemDoodle design is the simplicity of the interface. There are no right-click options: everything available is clearly labeled at all times (toggle integration, reset spectrum and view header information). This makes the game easier to learn and play.

I would especially like to thank Kevin Theisen for being so responsive on the ChemDoodle end. I was skeptical that we would have a playable game for this term but he addressed all of our major issues very quickly.

Predicting temperature-dependent solubility for solvent selection

During the summer of 2010 I reported on the Solvent Selector web service that Andrew Lang and I constructed. The idea was to flag potential solvents with a high solubility for the reactants and a low solubility for the product, so that the work-up would require a simple filtration.

If available, the Solvent Selector service uses measured solubility values. If not available, it attempts to predict the room temperature solubility using one of two models based on Abraham descriptors.

We now have modified the Solvent Selector so that it takes into account temperature. Andrew has inserted a thermometer icon next to each solvent in the report. When clicked, a plot is displayed over the entire range of temperatures where the solvent is a liquid. Curves for each starting material and the product are provided - and hovering over a data point provides numerical values.

There are several ways this resource could be used by chemists.

For reactions where some starting materials are not soluble enough at room temperature, the reaction could be carried out at a higher temperature. A higher temperature might also be desirable simply to speed up the reaction. Being able to predict the solubility of the product at that higher temperature would allow the course of the reaction to be monitored by the appearance of a precipitate.

For reactions where the solubility of the product is too high at room temperature, the curves could be used to estimate how low one could cool the reaction mixture without any chance that one of the starting materials would precipitate out. For example, consider the following Ugi reaction.


An optimization study was performed and found that methanol and ethanol provided much better yields than THF.(see JoVE article) This makes sense from a solubility standpoint, where the Ugi product room temperature solubility is less than 0.05 M for methanol and ethanol but is 0.26 M for THF Solvent Selector results)


However, by clicking on the thermometer icon for the THF entry, one gets the following temperature curves for solubility.

By hovering over the curve for the Ugi product we find that the predicted solubility in THF at -78C (conveniently a dry ice in acetone bath) is 0.01M, while that for the starting material boc-glycine is 0.65 M, well above the 0.5 M concentration at the start of the reaction. This means that even if the reaction did not take place to a significant extent, we would not expect the starting material to precipitate at -78C. Any precipitate should be the pure product.

Of course another obvious application is for solvent selection for re-crystallization.

How it works

For some time now we have been collecting literature on the temperature dependence of solubility in various systems (live Mendeley collection here - be patient it might take a minute to display). Although equations vary depending on the specific approach there seem to be the following commonalities:

  1. An assumption is made that miscibility is reached at the melting point of the solute.
  2. The log of the solubility is linearly proportional to the inverse of the temperature in Kelvin.

I have looked into a few examples that we have of solubility over a temperature range and the above do seem to hold. This means that with only a room temperature solubility and a melting point for a given solute, the solubility at any temperature can be interpolated or extrapolated. (The concentration at the miscibility point is calculated from the predicted density of the solute divided by its molecular weight).

Although there are situations where two liquids are not miscible because of extreme dissimilarity (e.g. methanol and hexane), for the most part our experience shows that the first assumption is valid. Also, we are not correcting for changes in density for the solute or solvent at different temperatures. Nevertheless, when only a single solubility measurement in a given solvent and a melting point are known, this simple model may prove to be of use as a rough guide for reaction design or re-crystallization. We'll report on its practicality over time as we put it to use.

Mirza PhD defense on the Ugi reaction for anti-malarial screening

My student Khalid Baig Mirza defended his Ph.D. thesis at Drexel University on December 6, 2010. It was nice to see all the pieces come together - even though there are still a few intriguing puzzles left to be fully resolved. Like most research projects, every answer generates several interesting additional questions and Khalid did a good job of showing what these key issues are for his work.

In his presentation, Khalid first discusses Open Notebook Science and his contribution to the sodium hydride oxidation controversy. Then he describes the UsefulChem project, involving the use of the Ugi reaction as an approach to synthesizing new anti-malarial agents, including a few unexpected side reactions and challenges. Finally he presents an overview of the ONS Solubility Challenge and its application to organic synthesis.

Visualizing Social Networks in Open Notebooks

Increasing the role of automation in the scientific process has long been a fundamental objective of Open Notebook Science. The automatic discovery of new connections in open scientific work is potentially a very important contribution to this end.

Visualizing social networks within and between Open Notebooks is certainly a good first step. Luckily, our Reaction Attempts project has already abstracted the key elements of organic chemical reactions within a collection of Open Notebooks. This means that creating connection maps between people and chemicals can be attempted with reliable and semantically unambiguous database sources.

The Reaction Attempts database records the identity of reactants and products as ChemSpiderIDs for each reaction within a collection of notebooks. Also the name of the researcher, the solvent, the yield (when available) and a few more key identifiers are recorded.

We are very fortunate that Don Pellegrino, an IST student at Drexel, has selected the analysis of networks within Open Notebooks as part of his Ph.D. work. He has started to report his progress on our wiki and is eager to receive any feedback as the work progresses (his FriendFeed account is donpellegrino).

Don's first report is available here. He is using the Open Source software Gephi for visualization and has provided all of the data and code on the associated wiki page. (also see Tony Hirst's description of mapping ONS work which provided some very useful insights) Don has provided a detailed report of his findings but I think the most important can be seen in the global plot below.
This represents a map connecting people through chemicals. The large top right structure represents the connections within the UsefulChem project and the main circle represents the activity of graduate student Khalid Mirza who was the most active on this project. The crescent structure to the right of the circle represents other students - mainly undergraduates - who worked with the same chemicals as Khalid.

At the top left there are 3 isolated small networks, representing completely separate projects: the sodium hydride (NaH) oxidation study, Dustin Sprouse and Sebastian Petrik. I'll be posting about Sebastian's work in a future post.
Near the bottom middle there is another small network connected to the main group by a single link mediated by 2,2-dimethoxyethylamine.

This represents the overlap between Open Notebooks (Wolfle from Todd group and Mirza from Bradley group) that I mentioned previously.

I think that automatically discovering such connections as they occur could be a really useful outcome of this network analysis work. For example, the researchers could be alerted by email that a new potentially interesting overlap between their projects now exists. This could accelerate new collaborations.
A key challenge in Don's work is to figure out the right questions so that the results will be genuinely useful and novel to the researchers involved and the research community. I'm optimistic that he will succeed. As a separate outcome, just learning how researchers collaborate and record their work over time is bound to be interesting.
For a description of Don's planned work over the next several months take a look at his full Thesis Proposal: "Proposal of a System and Methods for Integrating Literature and Data".

Chemical Information Validation Results from Fall 2010

As I mentioned earlier, one of the outcomes from my Fall 2010 Chemical Information Retrieval class involved the collection of chemical property information from different sources in a database format. Now that the course is over, this has resulted in 567 measurements for 24 compounds (including one compound EGCG from the previous term). I have curated the dataset to ensure that the original numbers, conversions to common units, categorizations, etc. are correct. Links to the information source or to a screenshot of the source are available for each entry - so if I missed something, anyone can unambiguously verify it for correction.

The dataset is available from a Google Spreadsheet. Andrew Lang has also created a web based interface: the ChemInfo Validation Explorer. By simply specifying the compound of interest and the property using drop-down menus the list of measurements from the relevant sources is provided with values outside of one standard deviation marked in orange. Links to the information source, or an image in cases where the information source cannot be directly linked, are provided in the results. Here is an example for the boiling point of benzene.

The visualization and analysis of the data was greatly facilitated by the use of Tableau Public. After downloading the free program anyone can easily re-create the queries in this post by first downloading the dataset as an Excel document then importing into Tableau Public. Interactive charts can then be freely hosted on the TP server and embedded as I have done in this post below.

The students were shown how to search both commercial and free information sources and were given complete freedom for which compounds and chemical properties to target. The results can be analyzed from the perspective of a reasonable sampling of the current state of chemical information available to the average chemist. The 5 most frequently obtained properties were melting point, density, boiling point, flash point and refractive index.

Sheet 1
Sheet 1

The information sources were categorized and are reported below by frequency. Chemical vendor sites were by far the most frequently used information source.
Sheet 1
Sheet 1

It is important to note that the information source does not represent the method by which the measurements were found. The source is simply the end of the chain of provenance: the document that provides no specific reference for the reported measurement. For example, even though ChemSpider was frequently used as a search engine, it would not be listed as an information source when it provided links to other sources (mainly MSDS sheets) for properties. ChemSpider was treated as a source for some predicted properties.

Sheet 1
Sheet 1

The chemical vendor Sigma-Aldrich was the most frequently used information source, followed by Alfa Aesar. Wolfram Alpha - categorized as a "free database" was third. Oxford University follows closely behind as fourth and is categorized as an "academic website", hosting MSDS sheets. Many universities host MSDS sheets but the Oxford web site seems to turn up most frequently from chemical property queries on search engines.

The fifth most frequent information source was Wikipedia, reflecting the fact that specific references are usually not provided there for chemical properties. Like ChemSpider, Wikipedia was categorized as a "crowdsourced database".

Flagging Outliers

One of the advantages of this type of collection is that it is much easier to identify outliers. In the case of non-aqueous solubility data, we were able to create an outlier bot to automatically flag potentially problematic results. Since different properties may have very different typical variabilities, outliers are most easily discovered by comparisons within the same property.

For example consider the following plot showing the standard deviation to mean ratio for melting point measurements.

Sheet 1
Sheet 1

This reveals that the average melting point for EGCG is suspect. At this point, an easy way to inspect the results is to use the Validation Explorer and look at the individual measurements.

By clicking on the images we can verify that the numbers have been correctly copied from the primary sources. In this case we can also ascertain that the sources - a peer reviewed paper and the Merck Index - are considered by most chemists to be generally reliable. There is no compelling reason at this point to weigh one result over the other and one has to be careful when using the average value for any practical application. (Note that all temperature data is recorded as Kelvin. A zero-based scale is necessary to ensure that the standard deviation to mean ratio is meaningful.)

The next flagging hit in this collection is the melting point of cyclohexanone. In this case 5 results are returned and the Validation Explorer highlights the Alfa Aesar value as being more than one standard deviation from the average.

However, one has to be careful when assessing this and assuming that the Alfa Aesar value is most likely to being the odd value out. Notice that 3 of the values - Sigma-Aldrich, Acros and Wolfram Alpha are identical. The most likely explanation for this is that all three used the same information source and should thus be counted as a single measurement.

We can test this hypothesis by looking for cases where Sigma-Aldrich, Acros and Wolfram Alpha don't share identical values. As shown below, for melting point measurements, there is no case where the values don't match.


The same is true for boiling points:

However, in the case of flash points it is clear that the three are not using a common data source.


Using the data we collected - and will continue to collect - we could start to identify which data sources are likely using the same ultimate sources and avoid over-counting measurements. This would save time in searching since one would know which sources to check for a particular property while avoiding duplication. This information is extremely difficult to obtain using other approaches.

The same type of outlier analysis can be performed for all the properties collected in this study.

I believe that there is much more useful analysis to be done on this dataset, especially for chemistry librarians. When this class is run next year, more data will be added. In the meantime, contributions from other sources would be welcome.

Nanoinformatics 2010 Conference Report

On November 3, 2010 I presented on "The implications of Open Notebook Science and other new forms of scientific communication for Nanoinformatics" at the Nanoinformatics 2010 conference.

The presentation first covers the use of the laboratory knowledge management system SMIRP for nanotechnology applications during the period of 1999-2001 at Drexel University. The exporting of single experiments from SMIRP and publication to the Chemistry Preprint Archive is then described followed by the evolution to Open Notebook Science in 2005. Abstraction of semantic structure from ONS projects in the areas of drug discovery and solubility is then detailed as an efficient mechanism to provide web services and machine readable data feeds.
This was a terrific opportunity to tie together my current ONS projects with my work in nanotechnology about 10 years ago, when the focus was to capture laboratory information in a structured format so that autonomous agent could begin to replace human workflows. I found it really interesting that the most active workflows back then were related to processing reference information. It took a team of students to find, photocopy and scan many of our key papers, with all the problems that come with training and managing new students. Today, obtaining relevant papers and extracting metadata is not so much of a challenge with tools like Mendeley. I ended the talk with a mention of our use of Mendeley tags to share dynamic links of article collections.
Another important development over the course of the past decade is the availability of free and hosted tools to easily communicate research. This includes wikis, blogs, Nature Precedings, institutional repositories, Google Spreadsheets and many others. It also includes some failed attempts like the Chemistry Preprint Archive.
I didn't anticipate in the late 90s just how crucial openness would prove to be for the evolution of the automation of the scientific process. It isn't my impression that there is currently a consensus on this point. Obviously it is possible to leverage automation in very clever ways for private use. But I think that exponential impact requires very low barriers to contribution (human or not) that can only be achieved with openness and transparency.
I have been very impressed with the ideas and projects discussed at this conference. Open sharing of nanotechnology data and integration of resources are clearly high priority items for many in this community.
As we have shown with our Reaction Attempts and ONS Solubility projects, abstracting meaningful semantic structure is necessarily field specific. One of the exciting opportunities to result from this meeting is finding ways to interconnect our solubility dataset with the nanotechnology community resources. I have met with a few people who would like to collaborate on this and I will be sure to report on our progress.

Dana Vanderwall on Cheminformatics at Drexel

Dana Vanderwall, Associate Director of Cheminformatics at Bristol-Myers Squibb, presented for my last Chemical Information Retrieval class on December 2, 2010.
The first part covered "Cheminformatics & The evolving relationship between data in the public domain & pharma" and included a general discussion of modern drug discovery and the details of a malaria dataset recently released from the pharmaceutical industry to the public.
The second part described a project based on "Molecular Clinical Safety Intelligence", where tracking side effects from approved drugs can help in the design of new drugs.
It was a very nice way to close out the course, showing very practical applications of the concepts we covered over the term. The recording is available below.

Elizabeth Brown’s guest lecture for ChemInfo Retrieval

Elizabeth Brown from the Binghamton University Libraries presented on "Web 0.0/1.0/2.0/3.0 and Chemical Information" on October 21, 2010 as a guest lecturer for my fifth class on Chemical Information Retrieval this term.

Beth made an interesting analogy with art to illustrate the differences between these communication platforms. What stuck me during her presentation was the similarity between the current state of the semantic web (Web3.0) and the state of computerized searching in chemistry when I was a graduate student in the early 90s. I had the chance to do just one substructure search at the time and it had to be done through an expert librarian. The search had to be carefully planned because of the expense.

From what I recall, the perception at the time was that computerized searching was impractical and perhaps even unnecessary. After all "the way" to search for chemical information was to spend a weekend in the library systematically going through Chemical Abstracts books. This had "worked" for long time for the chemistry community and doing things differently was considered superfluous and even wasteful.

Today, "the way" to search for chemical information is to use expensive databases to perform a targeted search and extract the information from mainly toll access peer reviewed journals. If you are off the academic grid you have to rely on free information on the web which is rarely associated with a chain of provenance. As my students will attest, even with access to the best tools, it still takes a lot of time to find the information and compare it for consistency.

The unfamiliarity with computerized searching in the early 90s is now the common attitude towards the semantic web. This is understandable because its availability is limited and people don't understand what to do with the tools that are available. Web services that we provide (derived from other services from ChemSpider and other sources) are usable by anyone who can open up a Google Spreadsheet and copy and paste a URL but it will take time for people to understand how to incorporate these into their workflows.

I think that in 10 years the semantic web will simply be part of the infrastructure. Currently, when you type a word in a browser or word processing software the system knows enough to alert you to possible misspelling by underlining a word. Similarly, in the near future, typing or specifying in some way a chemical compound will automatically pull in all relevant measured or calculated properties and provide suggestions in the context of a chemical reaction under consideration. Access to information will be free and unencumbered and the chain of provenance will be clear.

Instead of taking hours to find and process property data for single compounds, I think students in the future will be handling large libraries of compounds, looking for the best synthetic targets for their applications.

The Meaning of Data panel at a class on the Rhetoric of Science

Update: Lawrence Souder provided the audio for the presentation and panel questions.

On October 6, 2010 I had the pleasure to participate in a panel discussion at Lawrence Souder's class on the Rhetoric of Science. I gave a brief presentation on "The Meaning of Data" to kick off the discussion. I provided a scientist's perspective of data - with the basic idea that at best data are evidence. Data cannot be treated as irrefutable facts since there is always some uncertainty and many assumptions must be made in their interpretation. I also argued that, although uncertainty cannot be eliminated completely, transparency goes a long way to reducing it.

The students in the class did not have a science background and I think it was informative for them to explore a different perspective from their other exposure to science through popular media or even textbooks. The term "fact" is thrown around a lot in these information sources to simplify but it doesn't reflect how scientists think about data. We discussed how unbelievably wrong the scientific details in movies and TV shows such as NCIS and House can be. Nevertheless, one student pointed out that these shows can be effective in attracting people to science - as long as it was understood that the details were probably incorrect.
Because of market forces we recognized that science portrayed in the popular media would have a strong tendency to be exaggerated or oversimplified resulting in the phenomenon of "hype". A related effect can distort the way scientists communicate with each other, where there is a perception that exposing ambivalence in the form of seemingly contradictory data may not be in the researcher's best interest. This is a strong deterrent to the general adoption of transparency.
We explored the possible evolution of scientific communication as new tools such as blogs and wikis become increasingly used by scientists. Of course the issue of claims of priority was brought up and I discussed how this issue was handled in my own research work.
We debated whether these new forms of communication would alter the language used in scientific communication - even in traditional journals. I think that with the advent of the semantic web many researchers will start to write in a way that is understandable to humans as well as machines, their new target audience. One student remarked that the new generation coming up is very used to texting and that type of succinct communication is certainly in line with machine readability.
The assigned reading for the class involved a study of the language used by the Nobel laureates who discovered the buckyball to describe their research:

"In Praise of Carbon, In Praise of Science - The Epideictic Rhetoric of the 1996 Nobel Lectures in Chemistry by Christian Casper". This paper demonstrates that both personality and the perceived target audience can dramatically affect the language and focus that scientists use to explain their research.

Dynamic links to private tagged Mendeley collections

My close collaborators and I have been using Mendeley as a convenient way to share PDFs of journal articles. Not all of us have access to the same libraries so links are not enough - we need the full documents. We also use Dropbox as a redundancy but Mendeley allows tagging and recording notes, which is very handy for everyone in the group.

Now that Mendeley is providing an API, Andrew Lang has written code that significantly leverages the information in our private ONS collection. We can now create public links that return the most updated results for specific tags, including multiple tags (which I don't think you can do on Mendeley). For example the following link returns all articles in the ONS collection tagged with "science2.0" and "chemistry":

The results include available information from Mendeley, including the title, authors, journal citation, doi, url, tags and the abstract. Because this information is public the PDFs can't be provided but the hyperlinks make it as convenient as possible.

At the end of the report the full list of all available tags for the ONS collection is provided. A more refined or different search can be done immediately simply by checking boxes and hitting the submit button.Because the tags are controlled by the users of the private collection, these links can be useful when discussing an ongoing project and referring to a very specific topic. For example, we have been collecting examples of articles where a Ugi reaction is carried out and the product precipitates. This link provides an updated report on that very narrow topic:

http://showme.physics.drexel.edu/onsc/mendeley/?tags=Ugi+precipitate

There are still 2 major limitations to this service:

1) The search is very slow (can take a minute or two) because there is no way currently to use the Mendeley API to selectively return results based on tags. Every search requires initially returning all results for the collection (currently a few hundred).
2) Notes are currently not returned. If the API is updated to include these the usefulness would increase dramatically. For example in the results for the above query I took notes of the conditions involved in the Ugi precipitate for each paper. With the current format, one has to read each paper to find the relevant information.
Progress on our Mendeley related services will be posted on the ONSwebservices wiki.

Open Notebook Science in Drug Discovery at Opal Event

I presented on "Open Notebook Science in Drug Discovery" on August 24, 2010 at a panel on Industry and Academia part of the Opal Event "Drug Discovery: Easing the Bottleneck".

I only had about 15 minutes to present so I could not go into much detail but I did want to highlight the most recent work Andrew Lang and I (also with Peter Li from ChemTaverna) carried out involving solubility prediction and web services. Most of the attendees were from industry and I appropriately used the recent GSK malaria data sharing to introduce the talk. It is clear that there is a role for Open Science in drug discovery and I think that industry involvement will continue to increase in this area.

My co-panelist Rathindra Bose from Ohio University presented on his group's development of a novel cancer treatment compound based on platinum. He made the point that academic research complements that from industry by being able to explore more speculative hypotheses. The dominant hypothesis for the mechanism of action of platinum based drugs is binding with DNA. By exploring alternative scenarios, his group found an active platinum drug that does not bind with DNA.
During the preceding session on the Emergence of Biologics in Drug Discovery, Albert Giovanella from the University of Pennsylvania School of Medicine gave a particularly enlightening talk about comparing biologics with small molecule drugs. Although biological drugs tend to have less toxicity, the overall cost to bring them to market is still quite high and their cost to the consumer may be so high as to limit their impact. It looks like it will not be generally easy to translate new biomedical knowledge to a widespread impact on human health.