Wikipedia:WikiProject Chemistry/IRC discussions/6 May 2008

From Wikipedia, the free encyclopedia

--- Log opened Tue May 06 12:04:12 EDT 2008

12:04 -!- You're now known as dmacks_logging

12:04 <+Physchim62> who is logging?

12:05 <+Physchim62> that answers!

12:05 -!- mode/#wikichem [+v dmacks_logging] by ChanServ

12:05 <walkerma> dmacks_logging perhaps?

12:05 <+dmacks_logging> heh:)

12:05 <+Rifleman_82> only logging?

12:05 <+dmacks_logging> I'm here for a little bit.

12:05 <walkerma> OK, has everyone had a chance to look at the CAS SDF file?

12:06 <+dmacks_logging> ya

12:06 <+Rifleman_82> the last i saw were the strange barium nitrate structures

12:06 <+Physchim62> me, yes, it is one-quater included in CAVer

12:06 <+Rifleman_82> CAVer?

12:06 <walkerma> Have you looked at the whole file, Andrew?

12:07 <+Physchim62> CAVer is my prsonal database, with which I try to solve problems which can't be doen openly on WP

12:07 <+dmacks_logging> (tangent: these ionic-form messes might explain why SciFinder gives me such incorrect results when I search for ions related structures!)

12:08 <+Rifleman_82> no, i don't have an SDF reader. only been following the email discussions

12:08 <walkerma> Yes - even if you look at the print copy of Chem Abs, they've got a policy of handling things that way

12:08 <+Physchim62> yes, it's utter crap. I don't know how CAS made the file, but it is a simple error

12:08 <walkerma> (With neutral forms, not ionic)

12:08 <+Rifleman_82> any reader to recommend?

12:08 <walkerma> It's not a mistake, it's their policy

12:09 <+Rifleman_82> the policy is a mistake?

12:09 <+Rifleman_82> i've come across this on scifinder searches and it's was initially quite confusing

12:09 <+Physchim62> for those who have received ChemSpiderMan's email, I, personally, do not think this is a problem for inorganic comounds

12:09 <+Physchim62> *compounds

12:10 <walkerma> Andrew: The policy is a bit like our policy where we say, "We will represent charged structures as neutral unless there is a good reason to say otherwise

12:10 <+Physchim62> RM, yes, they have a strange policy ;)

12:10 <+dmacks_logging> It's consistent, though it's "less than ideal"...um, for almost any use I can conceive.

12:11 <walkerma> It probably made sense in 1955, or whenever they made the policy!

12:11 <walkerma> But as PC points out, it shouldn't be a problem

12:11 <+Physchim62> the real reason is that "dinitrate" in CAS nomenclature means something different from "bis(nitrate)" in IUPAC nomenclature

12:12 <+Physchim62> dinitrate = N2O5(4-)

12:13 <+dmacks_logging> Interesting, PC!

12:13 <+Physchim62> it is not a problem, I have already verified the inorganic structures vs. my database

12:14 <walkerma> I checked with Antony, and he agreed that we should be able to use names for many inorganics, to cross connect CAS nos with our records. So, PC, your news is very good!

12:14 <walkerma> To find out that we can in practice, not just in theory

12:14 <+Physchim62> inorganics are about 10% of their dataset, which is equivalent to their weight in WP

12:15 <walkerma> I thought we had more than that?

12:15 <walkerma> On WP

12:16 <+Physchim62> depends how you count it

12:16 <walkerma> Wikipedia:WikiProject Chemicals/Inorganics lists 1889 inorganics

12:16 <walkerma> Wikipedia:WikiProject Chemicals/Inorganics

12:19 <+Physchim62> there is nothing inconsistant, just a qu`eestion of what you define as an "inorganic". If I take the strict IUPAC definition, which I must do to create PINs, inorganics are about 10% of our articles

12:20 <walkerma> Do you have any idea how many inorganics are on their list but not on ours?

12:21 <walkerma> And how many of theirs WERE already in CAVer?

12:21 <walkerma> (CAVer = PC's collection built by combining Andrew's SDF + the inorganics list + some other data)

12:22 <+Physchim62> not certain because I hadn't completed checking articles

12:22 <walkerma> (CAVer should have nearly all compounds from PC)

12:22 <walkerma> Sorry, all compounds from WP!

12:22 <+Rifleman_82> i created an SDF?

12:22 <+Rifleman_82> oh, my list

12:22 <+Rifleman_82> emmanuel's too?

12:23 <walkerma> Sorry, I meant Antony's SDF, I'm getting confused!

12:23 <walkerma> I notice that barium nitrate (in their listing) is named kinda like the way they represent it: "nitric acid, barium salt". Is that causing us to miss it when we try to connect it to "barium nitrate"?

12:23 <+Physchim62> that is not a problem

12:24 <walkerma> How do you make the connection - is it via pseudonym, or via formula, then manual curation?

12:26 <+Physchim62> OK, to answer the question quickly, it's about 100: go to http://en.wikipedia.org/wiki/User:Physchim62/Wishlist and see the items without ICSC links: those are the inorganic structures from the CAS list for which I have not yet found articles

12:27 <+dmacks_logging> All three anisidines are stubbed on anisidine

12:28 <+Physchim62> for compounds without carbon atons, there is no problem

12:28 <+Physchim62> yes, that's a problem with my database

12:29 <+dmacks_logging> AIBN is listed without parens.

12:29 <+Physchim62> again

12:30 <walkerma> So, you're using formula to link CAS no. to article?

12:30 <+dmacks_logging> Ah, compounds without C *are* the problem?

12:30 <+Physchim62> I have concentrated on inorganics, because it's my speciality and because Antony was doing organics

12:30 <+Physchim62> no, *I* am linking by name

12:31 <+dmacks_logging> Oh wait, /me was looking for ICSC-no-Article, not no-ICSC:/

12:31 <+Physchim62> ?

12:31 <+Physchim62> I do that as well!

12:32 * dmacks_logging completely confused what the problem is here that we're trying to solve.

12:32 * Physchim62 sympathises

12:32 <walkerma> The problem is to ensure that the CAS no that is given to us by CAS is for the same compound that we have on WP

12:33 <+Physchim62> OK, good call chair

12:33 <+Physchim62> for example, Warfarin

12:33 <+Rifleman_82> and it's complicated by CAS' convention, e.g. barium nitrate

12:34 <+Rifleman_82> ?

12:34 <walkerma> Remember that the same "compound" may be represented by many different CAS nos.

12:34 <+Physchim62> Warfarin has the number 81-82-2, accoridning to CAS

12:36 <+Physchim62> it alsdo has other CAS numbers for it's two enantiomers

12:37 <walkerma> One for the racemic mixture, another for "unspecified"

12:37 <walkerma> Even with NaCl, I'd bet they have one CAS# for halite, another for sodium chloride

12:37 <walkerma> So we must be very careful or our validation effort will be worthless

12:38 <walkerma> Also see, on a related topic (partly inspired by this CAS to WP problem as well) see Antony's blog:

12:38 <walkerma> http://www.chemspider.com/blog/care-in-nomenclature-handling-and-why-visual-inspection-will-remain.html

12:38 <+Physchim62> it won't be worthless

12:39 <+Rifleman_82> just to check, warfarin as a drug is racemic?

12:39 <+Physchim62> *I* don't know

12:39 <+Rifleman_82> that's what it says on Warfarin

12:39 <+Physchim62> but there are three different CAS numbers

12:40 <walkerma> From WP:"Warfarin consists of a racemic mixture of two active optical isomers - R and S forms - each of which is cleared by different pathways. S-warfarin has five times the potency of the R-isomer with respect to vitamin K antagonism.[5]"

12:40 <+dmacks_logging> Ref 5 concurs that it's an R/S mix.

12:40 <+Physchim62> one of which I got from the CAS list and the other two from the European Chemical Agency

12:41 <+dmacks_logging> (...but only "in roughly equal proportion"...not explicitly 50/50 racemic?)

12:42 <walkerma> dmacks: Is that just legal-ese, for patent purposes, or is it genuine chemistry?

12:42 <+dmacks_logging> That's a quote from the JAmCollCardio article.

12:43 <+dmacks_logging> (which doesn't answer your question very well, I know)

12:43 <walkerma> I just wonder if they're non chemists hedging a bit, that's all!

12:43 <+dmacks_logging> Actually, they also do use the word "racemic"

12:43 <+Physchim62> OK, so what are we mean't to be doing?

12:43 <+dmacks_logging> So they're not "playing it safe with legalese", they just don't know the science:)

12:44 <walkerma> Possibly! Anyway, we should get back to the main issue, which is what to do next

12:44 * Physchim62 know's he has the data, but can't find it quickly

12:44 <walkerma> PC: Thanks for all your work so far, and thanks to Antony in absentia

12:45 <+Physchim62> Unless someone says "No"...

12:45 <walkerma> PC: Is there something else you need to raise on that?

12:46 <+Physchim62> I shall take charge of comounds without carbon atoms

12:46 <+Physchim62> which is about 10% of the data set, if you don't count elements

12:48 <+Physchim62> for the rest, I have a list of about 1200 compounds, for wich I have been able to verify data against other sources

12:49 <+Physchim62> Antony says he can match 2000 or more to his database

12:50 <+Physchim62> all of this leaves us with a black spot of about 2-3000 compou`ends

12:50 <+Physchim62> *compounds

12:51 <walkerma> I think we should be able to squeeze quite a few more out of the 2-3000

12:52 <walkerma> Once we find ways to connect the CAS name/formula to our name/formula

12:52 <walkerma> But those data will have to be manually curated

12:52 <walkerma> I chatted with Antony last night. He proposes that we aim to "roll out" batches of perhaps 500 compounds at a time

12:52 <walkerma> I think that would be a good idea.

12:53 <walkerma> What do others think of that idea?

12:53 <+dmacks_logging> seems reasonable

12:57 <+Physchim62> it's the only way

12:59 <+Physchim62> I do't known how to do it practically

13:00 <walkerma> So what remains to be done:

13:00 <walkerma> 1. Find which compounds are on both the CAS list and on WP. Antony has done this for organics, and PC is completing the job. Let's call that the CAS/WP Intersection List (CWIL).

13:00 <walkerma> 2. Go through the CWIL manually checking the structures and names to see if they make sense.

13:00 <walkerma> 3. Meanwhile work out a system for uploading the data onto WP, with vandalism protection,

13:00 <walkerma> 4. Once we have a set of 500 checked, we upload it to WP. Rinse and repeat until all of CWIL has been uploaded.

13:00 <walkerma> 5. Then try and check the CAS list further to try and squeeze more matches from it, using manual checks.

13:00 -!- itub [n=tubert@lalo.chemie.unibas.ch] has joined #wikichem

13:00 -!- mode/#wikichem [+v itub] by ChanServ

13:01 <walkerma> Hi itub! We're one hour into the discussion, I hope you knew...

13:01 <+itub> hi

13:01 <+itub> I didn't know, I just heard about the discussion

13:02 <+Physchim62> the CWIL for inorganic is only about 500 compounds (excluding organometallics and formates, etcs)

13:02 <+itub> I'll read the logs tomorrow, I guess

13:02 <walkerma> Welcome, anyway

13:02 <walkerma> Physchim62: So we could treat the inorganics as one single upload. Could you do the curation on that list?

13:03 <+Physchim62> walkerma, the problem is that we *DON'T* know which compounds are described on WP

13:04 <+Physchim62> I have already volonteered to to that: the inorganic crossover is about 500 compounds

13:04 <walkerma> PC: "we *DON'T* know which compounds are described on WP" - are you referring to the One Page Many Compounds problem?

13:06 <+Physchim62> walkerma, yes, that is *exactly* the problem I'm talking about

13:06 <walkerma> But that shouldn't be a problem within an SDF file, right?

13:06 <walkerma> Where we have an exact CAS# to compound match

13:06 <walkerma> Agreed?

13:07 <+Physchim62> it is still a problem with an SDF file, because we still need to know which compounds are described in which articles

13:08 <+Physchim62> no.

13:08 <walkerma> As I see it, we need simply to validate the SDF file itself, then work out how best to present the data from the SDF file on the appropriate WP page

13:08 <walkerma> Example: Tartaric acid

13:08 <walkerma> If they give us only one CAS# for the unspecified stereochem form

13:08 <+Physchim62> how do you think we find the correct WP page?

13:09 <walkerma> Then the chembox makes it clear that the CAS# refers to that form

13:09 <walkerma> The SDF file already has a link to the WP page in it

13:09 <walkerma> (Our SDF file)

13:09 <+Physchim62> which SDF file?

13:10 <walkerma> Antony's SDF file

13:10 <+Physchim62> CAS have systematically given us racemic forms

13:10 <walkerma> OK, that's probably very good if they are consistent in that

13:11 <+Physchim62> Antony's SDF file is useless to the general user of WP

13:11 <walkerma> Not if we have a bot that is uploading data using the SDF data as its source

13:12 <+Physchim62> Antony's SDF file used WP as it's source!

13:12 <walkerma> Or if someone really clever works out how to upload it into the

International Chemical Identifier
InChI=
InChIKey=
CASRN=
PIN=

persondaten collection?!!

13:12 <+Physchim62> I know that he has has included CAS data into it

13:12 <+dmacks_logging> Aw man. [[[Sulfanilic acid]] explicitly states that "it is a zwitterion", and then we use the neutral form in infobox.

13:13 -!- Physchim62 [n=Physchim@unaffiliated/physchim62] has quit ["What did you say this button does?"]

13:13 <+dmacks_logging> Leads to (probably correct) conlusion that mp, solubility, etc aren't for "the compound as drawn":(

13:13 <walkerma> dmacks - remember that we agreed (our policy!) to abolish all zwitterions in structure boxes

13:13 <walkerma> For the sake of consistency

13:13 -!- Physchim62 [n=Physchim@unaffiliated/physchim62] has joined #wikichem

13:13 -!- mode/#wikichem [+v Physchim62] by ChanServ

13:13 <+dmacks_logging> Yup. I'm just noticing an inconsistency that this (good IMO) policy triggered.

13:14 <walkerma> Otherwise you have a curation nightmare- you have to debate every compound that could possibly form a zwitterion

13:14 <walkerma> It's like amino acids, people just have to know that the way it's drawn is just a representation

13:15 <walkerma> Physchim62: To come back to the SDF

13:15 <+dmacks_logging> (sorry for off-topicing)

13:15 <+Physchim62> yep, I'm back, sorry

13:15 <+Physchim62> ;)

13:16 <walkerma> Antony and myself have been checking things in there manually, logging all inconsistencies, errors etc

13:16 <+Physchim62> and...

13:16 <walkerma> I've done #2001- about #3300, I had agreed to do 2001-4000

13:17 <walkerma> Antony has my (long) list of "issues"

13:17 <walkerma> So as I see it, we

13:17 <walkerma> have done the following:

13:17 <+Physchim62> I have one "issue": http://en.wikipedia.org/wiki/Chloramine-T

13:18 <walkerma> WP --> SDF(dirty) --> SDF(clean) --> SDF (clean with CAS)

13:18 <walkerma> We can then do the final step and upload it into WP again - now clean

13:19 <+Physchim62> but the CAS SDF is filthy

13:20 <+Physchim62> both myself and Antonty reckon that WP is 95% correct on CASRNs: that is more than can be said for the .sdf file that they sent us

13:20 <walkerma> The CAS numbers should not be filthy!

13:21 <+Physchim62> they are, unless you use them correctly$

13:21 <walkerma> Part of our validation process will be to find all CAS nos that don't match with ours - yes, Antony says about 5% - and we may be able to check those with CAS again if necessary.

13:22 <walkerma> Antony said that about 19 out of 20 do match with ours, and we can start uploading those

13:23 <+Physchim62> my *correction rate* for the moement is less than 1/1000

13:23 <+Physchim62> ie, we have had bad CAS nos but not many

13:25 <walkerma> So then, a necessary part of the curation work will be for us to manually check if our CAS # matches with CAS's CAS#. If yes, mark as OK; if no, flag the entry as a problem entry.

13:26 <+Physchim62> if CAS has given us a number which we can't interprete, yes, that's aproblem

13:27 <+Physchim62> but CAS has said that it will give us synonyms as well (one per compound)

13:27 <walkerma> But we can just flag that, and then revisit all the flagged entries, right?

13:27 <+Physchim62> Tony has been using InChIs, I've been using names, with synocnyms we woul both have another angle of attack on the database

13:29 <+Physchim62> the problem is, Martin, what do *you* want the next step to be

13:29 <walkerma> I'll repost my proposed plan:

13:30 <walkerma> 1. Find which compounds are on both the CAS list and on WP. Antony has done this for organics, and PC is completing the job. Let's call that the CAS/WP Intersection List (CWIL).

13:30 <walkerma> 2. Go through the CWIL manually checking the structures and names to see if they make sense.

13:30 <walkerma> 3. Meanwhile work out a system for uploading the data onto WP, with vandalism protection,

13:30 <walkerma> 4. Once we have a set of 500 checked, we upload it to WP. Rinse and repeat until all of CWIL has been uploaded.

13:30 <walkerma> 5. Then try and check the CAS list further to try and squeeze more matches from it, using manual checks.

13:30 <walkerma> And what we should do before we close, is to agree who will do what.

13:30 <walkerma> PC: It sounds like you have the inorganics portion under control, right?

13:30 <walkerma> WIll you have time to complete that?

13:31 <+Physchim62> inorganics are not the problem

13:32 <walkerma> Antony and I were planning on completing the curation of the organics - with help from others, if they want to pitch in

13:32 <walkerma> Antony has done some already amd so have I

13:33 <+Physchim62> I have already done point's 1 and 2 for inorganics and organometallics, I justb need tonupload them

13:33 <+Physchim62> *to upload

13:33 <+Physchim62> and for c1 organics, for that matter

13:34 <walkerma> So we're at #3 for inorganics, then?

13:35 <walkerma> How did the test of the database upload go, PC?

13:35 <walkerma> Is the Persondaten approach going to work?

13:35 <+Physchim62> I would say #4, in that I want to verify safety data at the same time

13:36 <+Physchim62> I have also includeed ICSCs and the EU database int CAVer

13:37 <+Physchim62> to upload, I need to learn how to write a bot 5or get a better internet connection)

13:38 <walkerma> How did you plan to do the upload? Into Persondaten format, or directly into articles, or what?

13:39 <+Physchim62> the persondaten approach will work, yes. I don't know if it's the best solution, but it is obviously an improvement on the current situation

13:39 <+Physchim62> My plan is: to see what happens

13:40 <+Physchim62> I will upload to articles, one-by-one and by-hand

13:40 <walkerma> OK: So we need someone (Beetstra?) to write a bot??

13:41 <+Physchim62> let me do the inorganics first!

13:41 <walkerma> (I think Beetstra is travelling at the moment, though)

13:41 <walkerma> But if we're at #4, don't we need the bot now?

13:42 <+Physchim62> yes, but for only 10% of the compounds!

13:42 <walkerma> Or were you thinking only manual upload, with bots only used to watch the data?

13:42 <+Physchim62> we need the bot for the other 90%

13:43 <+Physchim62> or we need time

13:43 <+Physchim62> ;)

13:43 <walkerma> How many compounds do you have ready for upload?

13:43 <+Physchim62> about 500 inorganics

13:44 <+Physchim62> about the same for organics, but I would prefer to wait on those

13:44 <walkerma> OK, so if we are uploading 500 at a time, how about making upload #1 the inorganics?

13:45 <+Physchim62> well, whatever you do, I'm going to do that

13:46 <+Physchim62> but it is hardly a qustion of "uloading 1000 at atime"

13:47 <walkerma> But if you upload the first 500 this month, Antony and I can probably have 500 organics totally ready by June

13:47 <walkerma> Ready for upload

13:48 <+Physchim62> if we can get a format which is agreed by the BotOwners ;)

13:49 <+Rifleman_82> gotta go, tired

13:49 <walkerma> OK, that's part of step #3 on the above list, and I think why we need to talk to bot writers now

13:49 <+Rifleman_82> we'll talk about the wikichem idea when antony is available

13:49 <walkerma> Bye RM82! Thanks!

13:49 <+Rifleman_82> :)

13:49 <+Physchim62> bye

13:49 <+Physchim62> thanjs

13:49 <+Rifleman_82> sorry can't really help the discussion, but where i can help i'll try

13:49 <+Rifleman_82> cya

13:50 -!- Rifleman_82 [n=blahblah@wikipedia/Rifleman-82] has quit []

13:51 <walkerma> Right, we should close soon, I think

13:51 <walkerma> As I see it, urgent issues are:

13:52 <walkerma> What PC mentioned, find a format for bot upload and get a bot written. Do we have anyone except Beetstra who can do this?

13:52 <walkerma> And second, get curating the data by hand

13:52 <walkerma> PC has done the inorganics, but the organics still need work

13:53 * dmacks_logging has completely zero time for the next few weeks:(

13:53 <+Physchim62> the second might be more important than the first

13:53 <walkerma> I think Antony and I have done around 2000-3000

13:53 <walkerma> but perhaps 2000 of the orignal list remain

13:53 <walkerma> Dmacks - I understand - exam time

13:54 <walkerma> I'll email RM and ask him for help, but if we just keep a steady stream of releases, 500 at a time - I think that will be fine.

13:54 <walkerma> We've waited since January for this, after all

13:55 <+dmacks_logging> I experimented with the "external indexable page" alternative to persondata, google likes it.

13:56 <walkerma> What is that? Please forgive my ignorance

13:56 <walkerma> Is it in straight HTML, not wiki?

13:57 <+dmacks_logging> Pushing the non-volatile and/or "long strings that users don't need to see" off of the chemical's main page.

13:57 <+dmacks_logging> It's purely a wiki game.

13:57 <walkerma> The data pages?

13:57 <+dmacks_logging> Bonus: allows bot to monitor/sync changes to that data without seeing "changes" to other parts of the article.

13:58 <walkerma> Or something else?

13:58 <+dmacks_logging> Something like the data pages, but for infobox info too (solves the "inchi are long, break the layout")

14:00 <+dmacks_logging> (googling for InChI=1/C6H7NO3S/c7-5-1-3-6(4-2-5)11(8,9)10/h1-4H,7H2,(H,8,9,10)/f/h8H gets you one click from the chemical)

14:00 <+Physchim62> I must go

14:00 <+Physchim62> ttfn

14:00 <walkerma> OK, bye, and thanks!

14:01 -!- Physchim62 [n=Physchim@unaffiliated/physchim62] has quit ["What did you say this button does?"]

14:02 <walkerma> That sounds interesting, dmacks_logging. Can YOU write a bot to handle that?

14:02 -!- itub [n=tubert@lalo.chemie.unibas.ch] has left #wikichem []

14:03 <+dmacks_logging> Probably not too hard (don't have to parse all the useles parts of the wikipages (i.e., any actual prose:). Have never tried to write a bot, it's top of my things-to-play-with-after-finals.

14:04 <walkerma> Well, if you could do that after finals it would be great! BTW, I'll be travelling to see family in England right after our finals, and I'll mostly be offline for three weeks May 22- June 14.

14:04 <walkerma> I'm sure you could get permission for such a bot, if you're willing to write one

14:05 <+dmacks_logging> Okay, I'm off for the week after Labor Day and then will work on it.

14:06 <+dmacks_logging> And should be off now too. Thanks as usual for chairing:)

14:07 <walkerma> Wonderful! I think we have a workable plan then - if all goes well, we can start uploading the data during June. You and PC should probably discuss the options of Persondaten vs separate wikipage. I know he has further plans for the future, and we need to include those.

14:07 <+dmacks_logging> okay

14:07 <walkerma> OK, I want to get on as well. Thanks a lot!

--- Log closed Tue May 06 14:07:57 EDT 2008