User:ProteinBoxBot

From Wikipedia, the free encyclopedia

Contents

[edit] Proposal overview (brief version)

This bot will create ~10,000 pages corresponding to mammalian genes. Each new page will be seeded with content from databases in the public domain. This content will include information about the gene's symbol, description, function, genomic location, structure and identifiers. ITK (gene) shows how each page will look.


[edit] ProteinBoxBot specs

  • Content has already been assembled as part of a non-WP project. Data will be provided to ProteinBoxBot as an XML or CSV file.
  • For each mammalian gene, a new gene page will be created that corresponds to the HUGO-approved symbol.
    • If a page with that name already exists, gene will be flagged for manual review
  • A protein infobox will be created and populated with relevant data. (Manually-created example: ITK (gene)
  • A redirect will be created from the full gene name. (For example: IL2-inducible T-cell kinase)
    • If a page with that name already exists, gene will be flagged for manual review


  • In trial phase, only 10 gene pages will be created.
  • Bot will check User_talk:ProteinBoxBot and stop with any new messages.
  • Bot will cap edits at 10 per minute.


[edit] Proposal overview (full version)

The text below taken, in part, from other previous discussions on Wikipedia:WikiProject_Molecular_and_Cellular_Biology/Proposals and User_Talk:WillowW

There is consensus that there are roughly 25,000 – 30,000 genes in the human genome, and a comparable number for all “close” relatives (primates, mouse, dog, rat, etc.). However, a big problem is not everyone agrees on the exact list of genes, and moreover, people can’t agree on what to call genes that everyone agrees exist. The ITK (gene) is a reasonable example. You’ll see here links to many different databases, all of which have a different ID for the same gene. Keeping track of all the cross-references between databases is quite a chore, so many examples of “gene portals” have been created to essentially keep up-to-date as each database evolves and new ones are added. We at GNF have one too (called SymAtlas) and we are somewhat unique in that we use our gene portal to also present data (primarily gene expression data) which we’ve generated and released to the public domain.

SymAtlas (and other gene portals) are great at displaying structured data – information stored in tables and databases. But, of course, they are not good for storing (much less collecting and displaying) “free text” information, and this is of course a strength of a wiki. My proposal is pretty straightforward. We can take our structured content in SymAtlas (which we’ve collected and maintain from a large number of public databases) and use these data to seed protein infoboxes for all genes en masse. We’ll also hyperlink from SymAtlas to Wikipedia to give our (somewhat sizeable) user community a place to add that “free text” knowledge. The stubs hopefully will lower the barrier for SymAtlas users to contribute (since they’ll be editing a page rather than creating one). In turn, the Wikipedia community can contribute the extensive editing, beautifying, vandalism-fighting, and everything-else expertise (and also the domain knowledge contributed by the MCB project) and really help things take off.

By analogy, biology journals often publish “review articles” which summarize the current knowledge in the literature for a particular gene or gene family. I would love it if the Wikipedia community maintained (and SymAtlas linked to) a continually-updated review article for every gene in the mammalian genome.

The game plan? Here's my current thoughts...

  • Create test page for approval of stub design (currently, ITK (gene) is the guinea pig)
  • Create bot to take structured content (from XML or database dump) and create WP content. See preliminary design specs below for the bot.
  • Create/append 5-10 pages using bot -- again, get consensus on approval
  • Let bot run loose...


Anyone interested in helping make this a reality? let me know...


[edit] Other notes and future ideas

General

  • Add note to talk page saying changes were made, leave comments (somewhere?)
  • Mechanism for users to interrupt actions of bot
  • List of changes on bot talk page?
  • Defining and adding categories?
  • Store update history in database?
  • adding links to page (e.g., "ITK") from alternate symbols (e.g., EMT; LYK; PSCTK2; MGC126257; MGC126258) and full gene name (e.g., IL2-inducible T-cell kinase)
    • is redirecting from alternate symbols really a good idea? How would one list ITK on the EMT disambiguation page?
  • changing redirects so that primary title is HGNC name
    • maybe just flag these for manual inspection
  • modify orthologs box to automatically adjust rows and columns based on data
  • possible add a comment to the protein box area saying that changes (to the protein box only) will be overwritten by the next bot update; this may help us from having to worry about manual edits -- AND/OR -- allow users to manually enter comment in protein box to prevent bot from overwriting
  • get structure image from RSCB
    • not sure yet how to get links from genes to PDB entries
    • RSCB public domain license is here.
  • add a comment <!--Add additional text here--> to make it clear where people can/should edit...
  • create a WP category for every GO category?


Actions

  • If no page exists, create page and add GNF protein box
  • Else if page exists but it’s a redirect, continue on target page
  • Else if page exists but no protein box, flag for inspection to determine if it’s the gene page
    • on approval that page is relevant to gene, add GNF protein box
    • on approval that page is not relevant to gene, create new “NAME_(gene)” page and set up disambiguation page
  • Else if page exists and has protein box, flag for manual inspection/integration?
  • For any changes, store version of data


Updates – quarterly?

  • Compare update to previous version – if no changes then stop
  • Compare previous version to current Wikipedia version (of protein box)
    • if same, then erase and update
    • if different, then flag for manual inspection/integration
  • if many things are flagged, then possibly develop an automated merging protocol