User:Walkerma/Sandbox5

From Wikipedia, the free encyclopedia

In preparation for the January 29th IRC meeting, I (User:Walkerma) posted a request on the [[Chemical Information Listserve. The request was picked up on the ChemSpider Blog and (I believe) at least one other Listserve. This page summarises the responses; I have matched them, so that response no. 3 for question 1 is from the same person as response no.3 for question 2. At the end I've added more general responses from two other people.

Contents

[edit] The original post

Thanks to work by Antony Williams, we (the chemists on Wikipedia) are currently validating the structural data on Wikipedia, and we are discussing the best way to present the information such as SMILES, InChIs, InChIKeys. To that end, I'd like to ask group members who use Wikipedia to reply to me (no need to clutter up the listserve) with their thoughts on the following:

  1. Do you ever search Wikipedia, or the Internet in general, for a structure using SMILES, InChI or InChIKey?
  2. Do you ever copy/paste such identifiers FROM Wikipedia into Google, etc, in order to do a search?

(I am well aware of the reliability question with Wikipedia, but let's not open up Pandora's box with that issue!)

We mainly want to find out if people need to SEE such identifiers in the article - bearing in mind they are designed for machines. We could hide them so a machine would see them but a casual reader would not, or "semi-hide" them (reader clicks to see). We could also place them on data pages such as this one: [1]

We are discussing this issue in our next IRC meeting - please join us if this is of special interest to you.[2]

Thanks for your time, Martin A. Walker (etc.)

[edit] Responses to question 1

Do you ever search Wikipedia, or the Internet in general, for a structure using SMILES, InChI or InChIKey?
  1. I wouldn't think of using the big three linear nomenclatures to search anything on the Net but from my limited knowledge of all 3, I don't think SMILES would be preferable. (Personally, I'd rather use names since the source of what I'm using, typically the general media, would only furnish names anyway.) The answer to your question depends on the audience. Chemistry professionals would be more likely to use Linears than the general (hopefully at least somewhat technically literate). Resources like Google and Wikipedia and coming to the attention of patent professionals as they enter the realm of prior art in patents. Unlike more classic forms of publication, they get rather tricky to use both in writing patents and in searching them.
  2. As a daily searcher of biochemical structures for a decade, I'll tell you that Wikipedia is way down the list on places I look for that. There are so few structures there compared with the millions at pubchem and chemspider that it's not worth the bother. I will search google before wiki. Granted wikipedia ocassionally comes up but not enough to look first. I do use Wikipedia a few times a week for other things but not chem structures. I do sometimes resort to smiles or inchi, when common name is lacking and the complicated and chopped up chem name would get too many incorrect hits. This also answers your question about utility - they're not much needed when there is a good common name for the compound and essential when the name is a multi-hyphenated iupac type name.
  3. I search multiple internet sources (e.g. ChemSpider, Wikipedia, search engines, online catalogs, etc) using various chemical identifiers (common names, IUPAC names, CAS number, SMILES, INCHI, etc). Since I didn't see your original posting, I don't have the full context for your questions.
  4. YES
  5. Yes and I would use it more if I knew that I would come up with more useful data, including Wikipedia entries. I copy SMILES or InChI strings from Chemdraw and paste them in internet searches.
  6. A bit, a would like to see more chemistry information located here. From my experience, you need to use at least 2 of these, ideally all 3 to ensure search-ability.
  7. never used them for a search. IMHO it's not for humans to compose such strings of characters and I have no software for composing them.
  8. Yes. I use InChIMatic:[3]
  9. not yet, no




[edit] Responses to question 2

Do you ever copy/paste such identifiers FROM Wikipedia into Google, etc, in order to do a search?
  1. I suppose that would be a facile ability if the answer to the 1st question is Yes.
  2. NA
  3. However, my primary input is that regardless of whether or not the various identifiers are immediately visible on a Wikipedia, I would really benefit from being able to access them from the compound entry with no more than one click. Ideally, I would like to have them available without having to make an additional click, but I also recognize that the majority of the people who use Wikipedia aren't necessarily chemists and might not realize how to take advantage of this information. Please let me know if you would like additional detail.
  4. YES
  5. This could be very useful, too. It doesn't matter if it's visible or not, a link "copy SMILES" would be enough. Thanks for asking.
  6. On occasion, I have been experimenting with this a bit w/r to chemistry and was surprised to find hits. As publishers like RSC start to adopt identifiers, for compounds (some for patents are starting to show up on open access sites), I believe this will become more useful. It seems that the time to start is now.
  7. I tried once copying/pasting some SMILES or InChI (I don't remember which) formulas from an article but without success. There possibly were some mistakes in the abstruse strings of characters, but who can check them?
  8. No. But come to think of it, it's pretty simple to add a link to a Google InChI query, for example:http://www.google.com/search?q=%22InChI%3D1%2FC10H8%2Fc1-2-6-10-8-4-3-7-9%2810%295-1%2Fh1-8H%22
  9. not into Google, but I'm likely to want to paste them into databases of chemical structures (pubchem perhaps?); not doing it yet, the info seems too sparse to be worth bothering with. Separately, followed your link, then went to the methanol page - there's already a slot there (rather than on the data page) for 'identifiers'. Smiles is incorrect (lowercase, should be upper, for methanol; hmm, but looks like it might be the chembox template that's transformed the original "CO" into "co") and InChI is missing. I'd favour having one or more standard structure representations of molecules on pages about molecules. And I'd favour using de facto standards (such as mol or sdf) as well as more sharable ones like SMILES and InChI.

[edit] Some general comments

[edit] Comments 1

OK, fine, but have you considered the reliability question with SMILES, InChI, and InChIKey strings, not to mention chemical names?

The first three of those are dependent not only on the particular drawing styles used, but also on the program used to generate them! Consider http://en.wikipedia.org/wiki/Glucose. It is physically impossible to a SMILES, InChI, or InChIKey string that represents "glucose" without specifying a specific cyclic and anomeric form. Can't be done. The formats don't support those concepts.

It *is* possible to generate strings that represent, say, the chain form. There are an infinite (literally) number of SMILES strings that can represent the chain form of glucose (or any structure), so they will never be useful as search keys (for copying and pasting into Google, etc., as you describe). InChI and InChIKey strings are *theoretically* unique, but theory is NOT practice. On the Wikipedia page, there is a block of 8 structural diagrams following the section on isomers, with the first two diagrams showing the Fischer and Mills forms of the straight chain of D-glucose. As a chemist, you'll surely agree that those two diagrams are 100% identical, representing the exact same chemical substance. Now go use the InChI software to generate InChI strings, and you'll find that the strings are *not* identical. The InChI software doesn't recognize Fischer projections, so it sees the Fischer diagram as having no stereochemistry at all, and the InChI you get is really the InChI for 2,3,4,5,6-pentahydroxyhexanal, not the one for D-glucose. The InChI strings generated for every aldohexose drawn as a Fischer projection will be identical. The ring forms are even worse. The nice chair forms drawn on that page will be recognized as having 2 stereocenters (I think, or maybe 3). If they are drawn in Mills form, the InChI program will recognize the proper 5 stereocenters. If they are drawn in perspective without wedged and bold bonds (which are commonly omitted in perspective diagrams, then the InChI program will see zero stereocenters. The diagrams represent identical structures -- and chemists agree that the diagrams represent identical structures -- but the InChI strings are not the same.

FWIW, that is not actually a limitation of InChI strings, but rather an implementation-specific dependency. We've actually made sure that ChemDraw 11.0 *will* generate identical InChI strings in the cases described above. That would totally eliminate the problem if we were foolish enough to believe that everyone was using ChemDraw. For practical purposes, as long as anyone is using any other software, it doesn't make much difference at all.

The bottom line is that none of those identifiers can be used to identify chemical substances *reliably*. They can be used, sure. They might give the right answer some of the time. They definitely will give the wrong answer some of the time. Anyone who suggests otherwise is doing a disservice to the chemical community. You can include them or not, but don't be fooled into thinking that they are more or less helpful, important, or accurate than any other information.

You asked.

---

And as long as I'm looking at the Glucose page, I might as well point out a few of the more-obvious errors:

C(C1C(C(C(C(O1)O)O)O)O)O is not a SMILES string for glucose. It is a SMILES string for the stereo-unspecified 6-(hydroxymethyl)tetrahydro-2H-pyran-2,3,4,5-tetraol.

6-(hydroxymethyl)oxane-2,3,4,5-tetrol is not a IUPAC name. That is to say, it isn't a IUPAC name for *anything*. It definitely isn't a IUPAC name for glucose.

(2R,3R,4S,5R,6R)-6-(hydroxymethyl)tetrahydro-2H-pyran-2,3,4,5-tetraol is a IUPAC name, but for beta-D-galactose, not any form of glucose.

The best IUPAC name for glucose is... "glucose". There are plenty of weird things in the IUPAC nomenclature recommendations, but sometimes they get things right, too.


To comment on the statements aboveI want to clarify he IS right about the limitations of chemical structure representations, InChI generation and nomenclature.

Also, regarding names and SMILES etc I did not check Glucose in this pass through the database. I was given two Excel files to start with and had to filter through those and glucose wasn't in there. I have since found there are many other structures/chemicals in Wikipedia that were not in the files also so there are likely many 10s I did not catch on this trawl through the DB. But I did get almost 5000 others! The situation with carbohydrates is very complex and I brought up some aspects of it here. There are so many different ways to represent such structures that there needs to be a WP:CHEM discussion/decision regarding how to represent them. There is a distinct difference between the structure IMAGE shown and that used to generate the appropriate identifiers. For example, this structure is using stereo bonds to communicate perspective (hashed/wedge). It is clear what is intended to be communicated but are they stereobonds? There are many examples of "bad structures" drawn on Wikipedia...but it is clear what they are communicating but they CANNOT be fed into algorithms for appropriate SMILES and InChI generation. The wrong answer will result. WHat we are dealing with is VERY complex...trying to create an electronic library of "accurate" chemical structures that can be used to generate SMILES, InChI, InChIKey. Different SMILES will likely be generated by different tools...ChemDraw, ChemSketch, OpenEye, Daylight, OpenBabel etc. There are SMILES on Wikipedia that I cannot convert with ANY of the tools I have at hand and I don't know where they came from. There are MANY names that are simply NOT IUPAC at all.Relative to the comments about the names... 6-(hydroxymethyl)oxane-2,3,4,5-tetrol DOES convert to a molecule consistent with the connectivities of a glucose-like molecule. A pyran ring with appropriate substitutions of hydroxyls etc. So...is it a IUPAC name? (2R,3R,4S,5R,6R)-6-(hydroxymethyl)tetrahydro-2H-pyran-2,3,4,5-tetraol is the name for beta-D-Galactopyranose. Not glucose. This is as commented above. These are the challenges we have ahead of us everyone...a myriad of details such as appropriate structure representations, checking stereochemistry at a level of fine detail, checking CAS numbers are valid, choosing the "right tools" to generate all the SMILES, InChIStrings and InChIKeys etc. Add to that just identifying all of the chemicals/structures on Wikipedia to validate and you can see the magnitude of the challenge. While I agree with the majority of what's been said I do believe there IS value in generating InChIStrings, SMILES and InChIKeys if only to provide a way for people to convert back to a structure representation else they will simply have to redraw. (InChIKey CANNOT be converted back to a structure of course). There are still a myriad of issues with this but I've already mentioned many of these elsewhere...). InChIStrings will NOT get indexed appropriately by google in many cases. InChI is NOT a cure-all. There are issues. Also true for SMILES. Molfile association with the record may be ideal for downloading the structure but it is not going to get indexed by search engines. And so the challenges and the conversation must continue....--ChemSpiderMan (talk) 07:07, 29 January 2008 (UTC)

[edit] Comments 2

think this is a great idea, but I don't think smiles can be the linking mechanism, just because smiles are not unique, nor is there a universal means of generating a unique smiles. For example

OCC ethanol C(C)O ethanol CCO ethanol C(O)C ethanol

are all valid smiles for ethanol. InChI strings would be a suitable unique structural linking form because 1. There is a reference implementation

2. It is designed to be unique This means there can be accurate and unambiguous links based on structure, generated and preceived at multiple different locations.

Now, this won't help all the other pathologies associated with structural representation like

-N(=O)(=O) vs -[N+](-[O-])=O, and protonation states of acids and such...

but it is a nice first step.

Of course then we get to other interesting ideas like trying to recognise chemical names and turning them into InChI strings, but that's probably further along the path.

Wouldn't it be great to paste in an InChI into Google and have it go fetch? Or click on a chemical reference in an article and have it go find all articles for that substance. Google is probably smart enough to index the InChI strings as is.

Chemistry will always be a "poor relative" on the information superhighway, but if we can manage to express ourselves in the common language, text, then we may be able to keep up.

Look forward to hearing how your work progresses...

Google cannot index all InChIStrings as they are...they get split after a certain number of characters etc...thus the InChIKey. InChIs can deal with a lot of the complexities of tautomers. Pasting InChIs into Google and go ftech DOES work for certain structures but has imperfections for sure..a long conversation. My focus for the work I am doing right now is to try and get at least consistency between the article name, an "appropriate" structural representation to match the article name, "a" SMILES string, "a" systematic name, and an inChIstring and InChIKey associated with the structure. It will be imperfect but I GUARANTEE better than what's on WP at present as we are already collectively catching MANY errors. But, this first pass will not be enough.--ChemSpiderMan (talk) 07:13, 29 January 2008 (UTC)