User talk:Jakob.scholbach/Archives/2008/May

From Wikipedia, the free encyclopedia

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Zeteo still rules, id suggestions

Howdy, I just wanted to mention again zeteo is extremely helpful and I appreciate your work on it.

Since you might be doing some feature request coding, I thought I would throw in some easy ones. Can we have separate fields for MathSciNet math review number and Zentrallblatt review number? Sometimes both reviews are nice, sometimes the id field is already overloaded. If you still have room, you could consider PubMed id too. If you still need something zeteo to code, then you could write auto-converters for the citations that already have {{MathSciNet}} templates.

Thanks again. JackSchmidt (talk) 20:53, 21 April 2008 (UTC)

Thanks for your feedback. The reason for the current structure is that I didn't want a field for every template which is out there (I'm currently adding the physics references, and they probably also have their templates etc). I see the use of Zentralblatt and Pubmed ID, but what prevents us from putting every extra id of this kind to the "additional id" field? This gets exported to "id" in the citation template, which is where I think it should show up. Let me ask this way: where do you want the pubmed id to show up in the exported reference template? For example

Feynman, Richard P. (1995), Six Easy Pieces, The Penguin Group, MR 0123456, arXiv:0123456, ISBN 978-0-14-027666-4

would be the reasonable output, I guess.

What I could do is showing the several additional id's in the zeteo interface in a more proper way. Do you talk about that, too? Perhaps you could give me a sample entry which causes you pain :) ? Jakob.scholbach (talk) 21:11, 21 April 2008 (UTC)

Yes, exactly what I was thinking, both good and bad. In some sense, it is just a user interface problem since the id field on the web form is too small for adding 3 or 4 ids. With fields for doi, isbn, oclc, and url, there is rarely need for more than 2 more ids. I agree it is difficult to know when to stop adding fields (maybe that time has already come).

Perhaps the form could recognise templates, and display/store them in separate fields, but have only one "Add new id" box?

Often I need to implicitly use the review to source my opinion of the work, and so it seems fair to give both the MR and the ZBl version of the story. Some older famous papers are done retrospectively on MR, but the original review appears from JFM on ZBl. I assume the same is true in BioStats where a paper may be given a favorable review in only one of MR versus PubMed, or perhaps a cold war article is favorably reviewed in exactly one of MR versus the russian RZ. Practically speaking, I often use both because ZBl is well written in german on the english wikipedia, but MR just plagiarised the abstract.

I think it is a good idea to separate them by meaning, but I would not argue. For instance, some papers are available from arxiv, via DOI, and from the author's homepage. I think it is good to store all three URLs since URLs so often stop working. However, if they had been stored as three URLs, it would be hard to tell which one to make the title link to, and which ones to just list as alternates. If they were stored under the arxiv, doi, and url fields (all but the arxiv already exists), then a uniform decision could be made (use DOI, show arxiv largely, and show url last, as it is most likely to break in the future). JackSchmidt (talk) 21:38, 21 April 2008 (UTC)

OK, I get it. I think I will try to improve the interface to cover the needs you point out. On the other hand, you and your situation seem to be at the forefront of referencing...

My main focus is at the moment filling the db with data. This is also pretty challenging. Perhaps you have a good idea for the following problem: I'm filling the db with lots of references and their corresponding authors. It happens that different persons have the same name and first name. I need to find a somewhat reliable means to decide or at least let zeteo make a suggestion whether the person which is already in the db is probably the same person as the author of the ref I'm about to add. The only information I have at hand is that person A has written something and person B (with the same name as A) has written something else. Currently, I more or less check whether the titles of the works seem to belong together and then I take the same author (if I'm not too lazy or tired to check this...). Right now, this problem does not yet occur too often, but I guess it will become a main challenge. Any ideas how to address this issue? Jakob.scholbach (talk) 10:18, 22 April 2008 (UTC)

This is called name authority. For books of interest to the library of congress, you can use their website to search their authority records. These records usually have a list of pseudonyms or alternate names, a list of representative works they wrote, some biographical information if available (from the book-jacket's "About the author" usually). The records are available systematically (though they are somewhat copyrighted) using Z39.50 or the php interface to the software "yaz", from the Deutsche national library and the library of congress (and some universities, and some national libraries).

However, for academic journals, especially in mathematics and physical sciences, it is unlikely that the author you need has a record. In the case of mathematics, MathSciNet has done some work to create an authority database. It is far from perfect (especially for 1940-1970), but usually errs on the side of "these are two different people", which is easier to fix later (push two piles together, versus sort one giant pile). For instance, searching for Albert Cohen shows you both "Cohen, Albert" and "Cohen, Albert(F-PARIS6)" who are likely the same person. However, this information does not show up in any of the exported formats, so I am not positive how to use it systematically. ZBl is misbehaving for me today. I forget if it has good name authority work done.

I believe PubMed and ERIC (education literature) have some name authority, but I do not use them much. I don't think "citeseer" or "web of science" does any authority work at all, but I rarely use them, so could easily be mistaken. JackSchmidt (talk) 13:31, 22 April 2008 (UTC)

As for the first problem: I've now provided a quick'n'dirty fix using a multiline text field. So there is visually some more space. Jakob.scholbach (talk) 19:43, 22 April 2008 (UTC)

Is it better now with the additional ids?

As for the naming issue: what do you think of the following idea: in order to compare the "collinearity" of two references of possibly one author or possibly two persons with the same name: make three google queries "Author Ref1", "Author Ref2" and "Author Ref1 Ref2" and use a Bayesian probablity formula together with a threshold (to be determined...) as to decide whether the two references are really belonging to the same author? Whatever method I use, it must be completely automatic. I have some 3 millions reference tags from all of en.wp. After programming and testing some thousand items manually, I want to add the stuff automatically. (Another issue which comes up sometimes is that people use different formats for authors, i.e. one has to be able to distinguish between first names and names, but I think I can do this with an additional database of first names). So, still a lot of work. But the more often I use zeteo, the greater I think the concept is... Jakob.scholbach (talk) 17:04, 29 April 2008 (UTC)

Your basic idea is fine: take a large body of work, and compare frequencies. This is used for automatic assignment of keywords, and there has been a lot of research on it. However, it is important to chose the body of work carefully.

The problem with data mining from google is that its data is dirty. You might be using a page called "Strange coincidences in author names, a list of completely unrelated people who have the same name and their complete works" and not know it. More realistic examples might be wikipedia database dumps, or google books results. There are a few ring theorists with the same name, and some textbooks (indexed by google books) cite both and even have a little paragraph or two about each. Google could easily return a hit for Author Ref1 Ref2 even if you included a biography. In other words, using google is very likely to mush piles together, not artificially split them. It is virtually impossible to split them (if there are k piles of n titles each, then mushing is O(k) and de-mushing is O(nk)).

If you could reach an agreement with the AMS (or maybe Zentralblatt?) about using their author database, I think the results would be far superior. Basically, in most respectable fields virtually ALL journal articles are indexed. I believe medical publications back to the middle ages are all indexed (though not all are available digitally, all were at the very least collected by the people behind pubmed). In other words, you may not need to create the name authority data, you may only need to query it.

If you do need to create it, then your idea is probably sound, especially if you can somehow restrict your searches to cleaner data. It will definitely produce errors that are very hard to correct, but that is unavoidable. JackSchmidt (talk) 13:30, 1 May 2008 (UTC)

Hm, OK. The thing is, that I would need a database containing authors of all domains, not only math. The math part is already done. I have played around with the google idea a bit. From few examples, I see that it does not seem to work that well. The problem is that some titles occur extremely often, which distorts the calculation. Other cases exhibit no hits at all (even when two references of apparently the same author are combined). I will think about a semantic approach. It would already be very good if I'm able to distinguish a work in medicine from something in physics, say. Then I may in the worst case glue different medical authors (with the same name) together, but this may be a not too bad drawback. I can't believe that it's impossible to reliably distinguish Observational Astrophysics from Peripartum concentrations of beta endorphin and cortisol and maternal mood states (both authored by "Smith, R" (in one case it was fortunately "Smith, Robert", though). By the way, even pubmed is not able to distinguish between homophonic authors ("Smith, R" gives 2400 hits, which can not quite be all the same person. Jakob.scholbach (talk) 18:27, 1 May 2008 (UTC)

I could not find anything on PubMed to sort the various R. Smiths. PubMed is a combination of several databases, and sometimes it is possible to do more refined searches when restricted to a single database (for instance on Academic Search Premier which is good for social sciences). I wasn't able to find such a feature on PubMed today, but I wrote to my online database and medical librarianship profs to ask. Unfortunately, I don't recall either having name authority as a huge hobby, but both have pretty substantial knowledge so might be able to suggest something. If it is still a problem in June, I'll look through the old MADS archives and see if there is someone there who has dome some work on it, but it is a longshot, as MADS never took off like MODS. JackSchmidt (talk) 18:57, 1 May 2008 (UTC)

One other idea that might interest library science people more is deciding if two citations are on the same basic subject (name authority is just hard, I don't think research will change that, but subject authority, people love that). I *know* the pay version of pubmed has some amazing subject cataloging done. The subject terms are arranged in a tree, and you can ask for article that match a subject term or any of its more specific incarnations. The principle rule in subject cataloging is to choose the (few) terms which are exactly as specific as the article itself. However, most search engines force you to move along the hierarchy manually. If you could navigate the subject tree (which you can in the pay version) then you could take an article, move all of its subject terms up a few levels in generality, and then search down for the article you've got in hand. That would at least group the papers roughly by subject. So perhaps there are still 5 distinct R. Smith's in endocrinology, we've gotten rid of 30 others. The main problem is whether the subject tree ("thesaurus") is available for a reasonable price in a reasonable format. MeSH probably has details, but I have to run. JackSchmidt (talk) 19:06, 1 May 2008 (UTC)

LocatorPlus.gov has some alternative databases. NLM's card catalog has some authority work done, and covers some journals. You'll probably have to test it to see if the coverage in your sample is good enough. I have found no more than three persons under one name header, and have found a few true authority headings. I have not actually found the place to display the whole authority record, but I suspect it is there.

There is http://www.nlm.nih.gov/mesh/MBrowser.html the MeSH browser to help out with the subject headings. Search for Simvastatin, then click on the tree number D02.455.426.559.847.638.400.900 to see the concept tree. JackSchmidt (talk) 15:54, 2 May 2008 (UTC)

Categories: User talk archives

User talk:Jakob.scholbach/Archives/2008/May

From Wikipedia, the free encyclopedia

Zeteo still rules, id suggestions

Views

Navigation

Interaction

Search