Wikipedia:WikiProject Chemistry/IRC discussions/6 May 2008
From Wikipedia, the free encyclopedia
--- Log opened Tue May 06 12:04:12 EDT 2008
12:04 -!- You're now known as dmacks_logging
12:04 <+Physchim62> who is logging?
12:05 <+Physchim62> that answers!
12:05 -!- mode/#wikichem [+v dmacks_logging] by ChanServ
12:05 <walkerma> dmacks_logging perhaps?
12:05 <+dmacks_logging> heh:)
12:05 <+Rifleman_82> only logging?
12:05 <+dmacks_logging> I'm here for a little bit.
12:05 <walkerma> OK, has everyone had a chance to look at the CAS SDF file?
12:06 <+dmacks_logging> ya
12:06 <+Rifleman_82> the last i saw were the strange barium nitrate structures
12:06 <+Physchim62> me, yes, it is one-quater included in CAVer
12:06 <+Rifleman_82> CAVer?
12:06 <walkerma> Have you looked at the whole file, Andrew?
12:07 <+Physchim62> CAVer is my prsonal database, with which I try to solve problems which can't be doen openly on WP
12:07 <+dmacks_logging> (tangent: these ionic-form messes might explain why SciFinder gives me such incorrect results when I search for ions related structures!)
12:08 <+Rifleman_82> no, i don't have an SDF reader. only been following the email discussions
12:08 <walkerma> Yes - even if you look at the print copy of Chem Abs, they've got a policy of handling things that way
12:08 <+Physchim62> yes, it's utter crap. I don't know how CAS made the file, but it is a simple error
12:08 <walkerma> (With neutral forms, not ionic)
12:08 <+Rifleman_82> any reader to recommend?
12:08 <walkerma> It's not a mistake, it's their policy
12:09 <+Rifleman_82> the policy is a mistake?
12:09 <+Rifleman_82> i've come across this on scifinder searches and it's was initially quite confusing
12:09 <+Physchim62> for those who have received ChemSpiderMan's email, I, personally, do not think this is a problem for inorganic comounds
12:09 <+Physchim62> *compounds
12:10 <walkerma> Andrew: The policy is a bit like our policy where we say, "We will represent charged structures as neutral unless there is a good reason to say otherwise
12:10 <+Physchim62> RM, yes, they have a strange policy ;)
12:10 <+dmacks_logging> It's consistent, though it's "less than ideal"...um, for almost any use I can conceive.
12:11 <walkerma> It probably made sense in 1955, or whenever they made the policy!
12:11 <walkerma> But as PC points out, it shouldn't be a problem
12:11 <+Physchim62> the real reason is that "dinitrate" in CAS nomenclature means something different from "bis(nitrate)" in IUPAC nomenclature
12:12 <+Physchim62> dinitrate = N2O5(4-)
12:13 <+dmacks_logging> Interesting, PC!
12:13 <+Physchim62> it is not a problem, I have already verified the inorganic structures vs. my database
12:14 <walkerma> I checked with Antony, and he agreed that we should be able to use names for many inorganics, to cross connect CAS nos with our records. So, PC, your news is very good!
12:14 <walkerma> To find out that we can in practice, not just in theory
12:14 <+Physchim62> inorganics are about 10% of their dataset, which is equivalent to their weight in WP
12:15 <walkerma> I thought we had more than that?
12:15 <walkerma> On WP
12:16 <+Physchim62> depends how you count it
12:16 <walkerma> Wikipedia:WikiProject Chemicals/Inorganics lists 1889 inorganics
12:16 <walkerma> Wikipedia:WikiProject Chemicals/Inorganics
12:19 <+Physchim62> there is nothing inconsistant, just a qu`eestion of what you define as an "inorganic". If I take the strict IUPAC definition, which I must do to create PINs, inorganics are about 10% of our articles
12:20 <walkerma> Do you have any idea how many inorganics are on their list but not on ours?
12:21 <walkerma> And how many of theirs WERE already in CAVer?
12:21 <walkerma> (CAVer = PC's collection built by combining Andrew's SDF + the inorganics list + some other data)
12:22 <+Physchim62> not certain because I hadn't completed checking articles
12:22 <walkerma> (CAVer should have nearly all compounds from PC)
12:22 <walkerma> Sorry, all compounds from WP!
12:22 <+Rifleman_82> i created an SDF?
12:22 <+Rifleman_82> oh, my list
12:22 <+Rifleman_82> emmanuel's too?
12:23 <walkerma> Sorry, I meant Antony's SDF, I'm getting confused!
12:23 <walkerma> I notice that barium nitrate (in their listing) is named kinda like the way they represent it: "nitric acid, barium salt". Is that causing us to miss it when we try to connect it to "barium nitrate"?
12:23 <+Physchim62> that is not a problem
12:24 <walkerma> How do you make the connection - is it via pseudonym, or via formula, then manual curation?
12:26 <+Physchim62> OK, to answer the question quickly, it's about 100: go to http://en.wikipedia.org/wiki/User:Physchim62/Wishlist and see the items without ICSC links: those are the inorganic structures from the CAS list for which I have not yet found articles
12:27 <+dmacks_logging> All three anisidines are stubbed on anisidine
12:28 <+Physchim62> for compounds without carbon atons, there is no problem
12:28 <+Physchim62> yes, that's a problem with my database
12:29 <+dmacks_logging> AIBN is listed without parens.
12:29 <+Physchim62> again
12:30 <walkerma> So, you're using formula to link CAS no. to article?
12:30 <+dmacks_logging> Ah, compounds without C *are* the problem?
12:30 <+Physchim62> I have concentrated on inorganics, because it's my speciality and because Antony was doing organics
12:30 <+Physchim62> no, *I* am linking by name
12:31 <+dmacks_logging> Oh wait, /me was looking for ICSC-no-Article, not no-ICSC:/
12:31 <+Physchim62> ?
12:31 <+Physchim62> I do that as well!
12:32 * dmacks_logging completely confused what the problem is here that we're trying to solve.
12:32 * Physchim62 sympathises
12:32 <walkerma> The problem is to ensure that the CAS no that is given to us by CAS is for the same compound that we have on WP
12:33 <+Physchim62> OK, good call chair
12:33 <+Physchim62> for example, Warfarin
12:33 <+Rifleman_82> and it's complicated by CAS' convention, e.g. barium nitrate
12:34 <+Rifleman_82> ?
12:34 <walkerma> Remember that the same "compound" may be represented by many different CAS nos.
12:34 <+Physchim62> Warfarin has the number 81-82-2, accoridning to CAS
12:36 <+Physchim62> it alsdo has other CAS numbers for it's two enantiomers
12:37 <walkerma> One for the racemic mixture, another for "unspecified"
12:37 <walkerma> Even with NaCl, I'd bet they have one CAS# for halite, another for sodium chloride
12:37 <walkerma> So we must be very careful or our validation effort will be worthless
12:38 <walkerma> Also see, on a related topic (partly inspired by this CAS to WP problem as well) see Antony's blog:
12:38 <walkerma> http://www.chemspider.com/blog/care-in-nomenclature-handling-and-why-visual-inspection-will-remain.html
12:38 <+Physchim62> it won't be worthless
12:39 <+Rifleman_82> just to check, warfarin as a drug is racemic?
12:39 <+Physchim62> *I* don't know
12:39 <+Rifleman_82> that's what it says on Warfarin
12:39 <+Physchim62> but there are three different CAS numbers
12:40 <walkerma> From WP:"Warfarin consists of a racemic mixture of two active optical isomers - R and S forms - each of which is cleared by different pathways. S-warfarin has five times the potency of the R-isomer with respect to vitamin K antagonism.[5]"
12:40 <+dmacks_logging> Ref 5 concurs that it's an R/S mix.
12:40 <+Physchim62> one of which I got from the CAS list and the other two from the European Chemical Agency
12:41 <+dmacks_logging> (...but only "in roughly equal proportion"...not explicitly 50/50 racemic?)
12:42 <walkerma> dmacks: Is that just legal-ese, for patent purposes, or is it genuine chemistry?
12:42 <+dmacks_logging> That's a quote from the JAmCollCardio article.
12:43 <+dmacks_logging> (which doesn't answer your question very well, I know)
12:43 <walkerma> I just wonder if they're non chemists hedging a bit, that's all!
12:43 <+dmacks_logging> Actually, they also do use the word "racemic"
12:43 <+Physchim62> OK, so what are we mean't to be doing?
12:43 <+dmacks_logging> So they're not "playing it safe with legalese", they just don't know the science:)
12:44 <walkerma> Possibly! Anyway, we should get back to the main issue, which is what to do next
12:44 * Physchim62 know's he has the data, but can't find it quickly
12:44 <walkerma> PC: Thanks for all your work so far, and thanks to Antony in absentia
12:45 <+Physchim62> Unless someone says "No"...
12:45 <walkerma> PC: Is there something else you need to raise on that?
12:46 <+Physchim62> I shall take charge of comounds without carbon atoms
12:46 <+Physchim62> which is about 10% of the data set, if you don't count elements
12:48 <+Physchim62> for the rest, I have a list of about 1200 compounds, for wich I have been able to verify data against other sources
12:49 <+Physchim62> Antony says he can match 2000 or more to his database
12:50 <+Physchim62> all of this leaves us with a black spot of about 2-3000 compou`ends
12:50 <+Physchim62> *compounds
12:51 <walkerma> I think we should be able to squeeze quite a few more out of the 2-3000
12:52 <walkerma> Once we find ways to connect the CAS name/formula to our name/formula
12:52 <walkerma> But those data will have to be manually curated
12:52 <walkerma> I chatted with Antony last night. He proposes that we aim to "roll out" batches of perhaps 500 compounds at a time
12:52 <walkerma> I think that would be a good idea.
12:53 <walkerma> What do others think of that idea?
12:53 <+dmacks_logging> seems reasonable
12:57 <+Physchim62> it's the only way
12:59 <+Physchim62> I do't known how to do it practically
13:00 <walkerma> So what remains to be done:
13:00 <walkerma> 1. Find which compounds are on both the CAS list and on WP. Antony has done this for organics, and PC is completing the job. Let's call that the CAS/WP Intersection List (CWIL).
13:00 <walkerma> 2. Go through the CWIL manually checking the structures and names to see if they make sense.
13:00 <walkerma> 3. Meanwhile work out a system for uploading the data onto WP, with vandalism protection,
13:00 <walkerma> 4. Once we have a set of 500 checked, we upload it to WP. Rinse and repeat until all of CWIL has been uploaded.
13:00 <walkerma> 5. Then try and check the CAS list further to try and squeeze more matches from it, using manual checks.
13:00 -!- itub [n=tubert@lalo.chemie.unibas.ch] has joined #wikichem
13:00 -!- mode/#wikichem [+v itub] by ChanServ
13:01 <walkerma> Hi itub! We're one hour into the discussion, I hope you knew...
13:01 <+itub> hi
13:01 <+itub> I didn't know, I just heard about the discussion
13:02 <+Physchim62> the CWIL for inorganic is only about 500 compounds (excluding organometallics and formates, etcs)
13:02 <+itub> I'll read the logs tomorrow, I guess
13:02 <walkerma> Welcome, anyway
13:02 <walkerma> Physchim62: So we could treat the inorganics as one single upload. Could you do the curation on that list?
13:03 <+Physchim62> walkerma, the problem is that we *DON'T* know which compounds are described on WP
13:04 <+Physchim62> I have already volonteered to to that: the inorganic crossover is about 500 compounds
13:04 <walkerma> PC: "we *DON'T* know which compounds are described on WP" - are you referring to the One Page Many Compounds problem?
13:06 <+Physchim62> walkerma, yes, that is *exactly* the problem I'm talking about
13:06 <walkerma> But that shouldn't be a problem within an SDF file, right?
13:06 <walkerma> Where we have an exact CAS# to compound match
13:06 <walkerma> Agreed?
13:07 <+Physchim62> it is still a problem with an SDF file, because we still need to know which compounds are described in which articles
13:08 <+Physchim62> no.
13:08 <walkerma> As I see it, we need simply to validate the SDF file itself, then work out how best to present the data from the SDF file on the appropriate WP page
13:08 <walkerma> Example: Tartaric acid
13:08 <walkerma> If they give us only one CAS# for the unspecified stereochem form
13:08 <+Physchim62> how do you think we find the correct WP page?
13:09 <walkerma> Then the chembox makes it clear that the CAS# refers to that form
13:09 <walkerma> The SDF file already has a link to the WP page in it
13:09 <walkerma> (Our SDF file)
13:09 <+Physchim62> which SDF file?
13:10 <walkerma> Antony's SDF file
13:10 <+Physchim62> CAS have systematically given us racemic forms
13:10 <walkerma> OK, that's probably very good if they are consistent in that
13:11 <+Physchim62> Antony's SDF file is useless to the general user of WP
13:11 <walkerma> Not if we have a bot that is uploading data using the SDF data as its source
13:12 <+Physchim62> Antony's SDF file used WP as it's source!
13:12 <walkerma> Or if someone really clever works out how to upload it into the
International Chemical Identifier | |
---|---|
InChI= | |
InChIKey= | |
CASRN= | |
PIN= |
persondaten collection?!!
13:12 <+Physchim62> I know that he has has included CAS data into it
13:12 <+dmacks_logging> Aw man. [[[Sulfanilic acid]] explicitly states that "it is a zwitterion", and then we use the neutral form in infobox.
13:13 -!- Physchim62 [n=Physchim@unaffiliated/physchim62] has quit ["What did you say this button does?"]
13:13 <+dmacks_logging> Leads to (probably correct) conlusion that mp, solubility, etc aren't for "the compound as drawn":(
13:13 <walkerma> dmacks - remember that we agreed (our policy!) to abolish all zwitterions in structure boxes
13:13 <walkerma> For the sake of consistency
13:13 -!- Physchim62 [n=Physchim@unaffiliated/physchim62] has joined #wikichem
13:13 -!- mode/#wikichem [+v Physchim62] by ChanServ
13:13 <+dmacks_logging> Yup. I'm just noticing an inconsistency that this (good IMO) policy triggered.
13:14 <walkerma> Otherwise you have a curation nightmare- you have to debate every compound that could possibly form a zwitterion
13:14 <walkerma> It's like amino acids, people just have to know that the way it's drawn is just a representation
13:15 <walkerma> Physchim62: To come back to the SDF
13:15 <+dmacks_logging> (sorry for off-topicing)
13:15 <+Physchim62> yep, I'm back, sorry
13:15 <+Physchim62> ;)
13:16 <walkerma> Antony and myself have been checking things in there manually, logging all inconsistencies, errors etc
13:16 <+Physchim62> and...
13:16 <walkerma> I've done #2001- about #3300, I had agreed to do 2001-4000
13:17 <walkerma> Antony has my (long) list of "issues"
13:17 <walkerma> So as I see it, we
13:17 <walkerma> have done the following:
13:17 <+Physchim62> I have one "issue": http://en.wikipedia.org/wiki/Chloramine-T
13:18 <walkerma> WP --> SDF(dirty) --> SDF(clean) --> SDF (clean with CAS)
13:18 <walkerma> We can then do the final step and upload it into WP again - now clean
13:19 <+Physchim62> but the CAS SDF is filthy
13:20 <+Physchim62> both myself and Antonty reckon that WP is 95% correct on CASRNs: that is more than can be said for the .sdf file that they sent us
13:20 <walkerma> The CAS numbers should not be filthy!
13:21 <+Physchim62> they are, unless you use them correctly$
13:21 <walkerma> Part of our validation process will be to find all CAS nos that don't match with ours - yes, Antony says about 5% - and we may be able to check those with CAS again if necessary.
13:22 <walkerma> Antony said that about 19 out of 20 do match with ours, and we can start uploading those
13:23 <+Physchim62> my *correction rate* for the moement is less than 1/1000
13:23 <+Physchim62> ie, we have had bad CAS nos but not many
13:25 <walkerma> So then, a necessary part of the curation work will be for us to manually check if our CAS # matches with CAS's CAS#. If yes, mark as OK; if no, flag the entry as a problem entry.
13:26 <+Physchim62> if CAS has given us a number which we can't interprete, yes, that's aproblem
13:27 <+Physchim62> but CAS has said that it will give us synonyms as well (one per compound)
13:27 <walkerma> But we can just flag that, and then revisit all the flagged entries, right?
13:27 <+Physchim62> Tony has been using InChIs, I've been using names, with synocnyms we woul both have another angle of attack on the database
13:29 <+Physchim62> the problem is, Martin, what do *you* want the next step to be
13:29 <walkerma> I'll repost my proposed plan:
13:30 <walkerma> 1. Find which compounds are on both the CAS list and on WP. Antony has done this for organics, and PC is completing the job. Let's call that the CAS/WP Intersection List (CWIL).
13:30 <walkerma> 2. Go through the CWIL manually checking the structures and names to see if they make sense.
13:30 <walkerma> 3. Meanwhile work out a system for uploading the data onto WP, with vandalism protection,
13:30 <walkerma> 4. Once we have a set of 500 checked, we upload it to WP. Rinse and repeat until all of CWIL has been uploaded.
13:30 <walkerma> 5. Then try and check the CAS list further to try and squeeze more matches from it, using manual checks.
13:30 <walkerma> And what we should do before we close, is to agree who will do what.
13:30 <walkerma> PC: It sounds like you have the inorganics portion under control, right?
13:30 <walkerma> WIll you have time to complete that?
13:31 <+Physchim62> inorganics are not the problem
13:32 <walkerma> Antony and I were planning on completing the curation of the organics - with help from others, if they want to pitch in
13:32 <walkerma> Antony has done some already amd so have I
13:33 <+Physchim62> I have already done point's 1 and 2 for inorganics and organometallics, I justb need tonupload them
13:33 <+Physchim62> *to upload
13:33 <+Physchim62> and for c1 organics, for that matter
13:34 <walkerma> So we're at #3 for inorganics, then?
13:35 <walkerma> How did the test of the database upload go, PC?
13:35 <walkerma> Is the Persondaten approach going to work?
13:35 <+Physchim62> I would say #4, in that I want to verify safety data at the same time
13:36 <+Physchim62> I have also includeed ICSCs and the EU database int CAVer
13:37 <+Physchim62> to upload, I need to learn how to write a bot 5or get a better internet connection)
13:38 <walkerma> How did you plan to do the upload? Into Persondaten format, or directly into articles, or what?
13:39 <+Physchim62> the persondaten approach will work, yes. I don't know if it's the best solution, but it is obviously an improvement on the current situation
13:39 <+Physchim62> My plan is: to see what happens
13:40 <+Physchim62> I will upload to articles, one-by-one and by-hand
13:40 <walkerma> OK: So we need someone (Beetstra?) to write a bot??
13:41 <+Physchim62> let me do the inorganics first!
13:41 <walkerma> (I think Beetstra is travelling at the moment, though)
13:41 <walkerma> But if we're at #4, don't we need the bot now?
13:42 <+Physchim62> yes, but for only 10% of the compounds!
13:42 <walkerma> Or were you thinking only manual upload, with bots only used to watch the data?
13:42 <+Physchim62> we need the bot for the other 90%
13:43 <+Physchim62> or we need time
13:43 <+Physchim62> ;)
13:43 <walkerma> How many compounds do you have ready for upload?
13:43 <+Physchim62> about 500 inorganics
13:44 <+Physchim62> about the same for organics, but I would prefer to wait on those
13:44 <walkerma> OK, so if we are uploading 500 at a time, how about making upload #1 the inorganics?
13:45 <+Physchim62> well, whatever you do, I'm going to do that
13:46 <+Physchim62> but it is hardly a qustion of "uloading 1000 at atime"
13:47 <walkerma> But if you upload the first 500 this month, Antony and I can probably have 500 organics totally ready by June
13:47 <walkerma> Ready for upload
13:48 <+Physchim62> if we can get a format which is agreed by the BotOwners ;)
13:49 <+Rifleman_82> gotta go, tired
13:49 <walkerma> OK, that's part of step #3 on the above list, and I think why we need to talk to bot writers now
13:49 <+Rifleman_82> we'll talk about the wikichem idea when antony is available
13:49 <walkerma> Bye RM82! Thanks!
13:49 <+Rifleman_82> :)
13:49 <+Physchim62> bye
13:49 <+Physchim62> thanjs
13:49 <+Rifleman_82> sorry can't really help the discussion, but where i can help i'll try
13:49 <+Rifleman_82> cya
13:50 -!- Rifleman_82 [n=blahblah@wikipedia/Rifleman-82] has quit []
13:51 <walkerma> Right, we should close soon, I think
13:51 <walkerma> As I see it, urgent issues are:
13:52 <walkerma> What PC mentioned, find a format for bot upload and get a bot written. Do we have anyone except Beetstra who can do this?
13:52 <walkerma> And second, get curating the data by hand
13:52 <walkerma> PC has done the inorganics, but the organics still need work
13:53 * dmacks_logging has completely zero time for the next few weeks:(
13:53 <+Physchim62> the second might be more important than the first
13:53 <walkerma> I think Antony and I have done around 2000-3000
13:53 <walkerma> but perhaps 2000 of the orignal list remain
13:53 <walkerma> Dmacks - I understand - exam time
13:54 <walkerma> I'll email RM and ask him for help, but if we just keep a steady stream of releases, 500 at a time - I think that will be fine.
13:54 <walkerma> We've waited since January for this, after all
13:55 <+dmacks_logging> I experimented with the "external indexable page" alternative to persondata, google likes it.
13:56 <walkerma> What is that? Please forgive my ignorance
13:56 <walkerma> Is it in straight HTML, not wiki?
13:57 <+dmacks_logging> Pushing the non-volatile and/or "long strings that users don't need to see" off of the chemical's main page.
13:57 <+dmacks_logging> It's purely a wiki game.
13:57 <walkerma> The data pages?
13:57 <+dmacks_logging> Bonus: allows bot to monitor/sync changes to that data without seeing "changes" to other parts of the article.
13:58 <walkerma> Or something else?
13:58 <+dmacks_logging> Something like the data pages, but for infobox info too (solves the "inchi are long, break the layout")
14:00 <+dmacks_logging> (googling for InChI=1/C6H7NO3S/c7-5-1-3-6(4-2-5)11(8,9)10/h1-4H,7H2,(H,8,9,10)/f/h8H gets you one click from the chemical)
14:00 <+Physchim62> I must go
14:00 <+Physchim62> ttfn
14:00 <walkerma> OK, bye, and thanks!
14:01 -!- Physchim62 [n=Physchim@unaffiliated/physchim62] has quit ["What did you say this button does?"]
14:02 <walkerma> That sounds interesting, dmacks_logging. Can YOU write a bot to handle that?
14:02 -!- itub [n=tubert@lalo.chemie.unibas.ch] has left #wikichem []
14:03 <+dmacks_logging> Probably not too hard (don't have to parse all the useles parts of the wikipages (i.e., any actual prose:). Have never tried to write a bot, it's top of my things-to-play-with-after-finals.
14:04 <walkerma> Well, if you could do that after finals it would be great! BTW, I'll be travelling to see family in England right after our finals, and I'll mostly be offline for three weeks May 22- June 14.
14:04 <walkerma> I'm sure you could get permission for such a bot, if you're willing to write one
14:05 <+dmacks_logging> Okay, I'm off for the week after Labor Day and then will work on it.
14:06 <+dmacks_logging> And should be off now too. Thanks as usual for chairing:)
14:07 <walkerma> Wonderful! I think we have a workable plan then - if all goes well, we can start uploading the data during June. You and PC should probably discuss the options of Persondaten vs separate wikipage. I know he has further plans for the future, and we need to include those.
14:07 <+dmacks_logging> okay
14:07 <walkerma> OK, I want to get on as well. Thanks a lot!
--- Log closed Tue May 06 14:07:57 EDT 2008