User:ProteinBoxBot/Ideas

From Wikipedia, the free encyclopedia

Contents

[edit] Proposal overview (brief version)

This bot will create ~10,000 pages corresponding to mammalian genes. Each new page will be seeded with content from databases in the public domain. This content will include information about the gene's symbol, description, function, genomic location, structure and identifiers. ITK (gene) shows how each page will look. The full specifications are listed here.

[edit] Odds 'n ends

List of things that need to be done before any further PBB runs.

  • change images to come from EBI
  • turn off summary updates if any wikilinking is detected in PBB_Summary

[edit] Proposal overview (full version)

The text below taken, in part, from other previous discussions on Wikipedia:WikiProject_Molecular_and_Cellular_Biology/Proposals and User_Talk:WillowW

There is consensus that there are roughly 25,000 – 30,000 genes in the human genome, and a comparable number for all “close” relatives (primates, mouse, dog, rat, etc.). However, a big problem is not everyone agrees on the exact list of genes, and moreover, people can’t agree on what to call genes that everyone agrees exist. The ITK (gene) is a reasonable example. You’ll see here links to many different databases, all of which have a different ID for the same gene. Keeping track of all the cross-references between databases is quite a chore, so many examples of “gene portals” have been created to essentially keep up-to-date as each database evolves and new ones are added. We at GNF have one too (called SymAtlas) and we are somewhat unique in that we use our gene portal to also present data (primarily gene expression data) which we’ve generated and released to the public domain.

SymAtlas (and other gene portals) are great at displaying structured data – information stored in tables and databases. But, of course, they are not good for storing (much less collecting and displaying) “free text” information, and this is of course a strength of a wiki. My proposal is pretty straightforward. We can take our structured content in SymAtlas (which we’ve collected and maintain from a large number of public databases) and use these data to seed protein infoboxes for all genes en masse. We’ll also hyperlink from SymAtlas to Wikipedia to give our (somewhat sizeable) user community a place to add that “free text” knowledge. The stubs hopefully will lower the barrier for SymAtlas users to contribute (since they’ll be editing a page rather than creating one). In turn, the Wikipedia community can contribute the extensive editing, beautifying, vandalism-fighting, and everything-else expertise (and also the domain knowledge contributed by the MCB project) and really help things take off.

By analogy, biology journals often publish “review articles” which summarize the current knowledge in the literature for a particular gene or gene family. I would love it if the Wikipedia community maintained (and SymAtlas linked to) a continually-updated review article for every gene in the mammalian genome.

Anyone interested in helping make this a reality? let me know...


[edit] Other notes and future ideas

NOTE: The items below are thoughts for the future and are not included in the initial proposed specs.

  • Upload snapshots of all PDB images -- create a gallery?
  • Load PPI from Entrez Gene User_talk:ProteinBoxBot/Archives/Archive1#Interaction_partners
  • Revise references treatment to allow primary research articles (User_talk:ProteinBoxBot/Archives/Archive1#A_suggestion)
  • Add GeneRIFs and references from Uniprot
  • add reference to GO section of infobox linking Entrez Gene
  • Add a legend to the protein infobox, especially to explain what the expression profiles mean and how they were generated. See User_talk:ProteinBoxBot/Archives/Archive1#Some_comments_and_a_question
  • Add note to talk page saying changes were made, leave comments (somewhere?)
  • import and display EC number
  • import and display protein domain information (through Uniprot/PFAM/COGs) See previous discussion.
  • link to HPRD
  • add MCB template to talk page
  • change images to upload to Wikipedia:Wikimedia Commons
  • Create more precise PDB caption by using the PDB "title"
  • Change PDB image name to correspond to the PDB ID, not the gene Symbol
  • add a "update_PDB_image" tag in PBB_controls so that people can turn off automated edits for that part of the infobox specifically
  • UniProt fields: PFAM, "Protein name", "Synonyms", FUNCTION, DOMAIN, SUBCELLULAR LOCATION, CATALYTIC ACTIVITY, COFACTOR, SUBUNIT, and WEB RESOURCE
  • Mechanism for users to interrupt actions of bot
  • List of changes on bot talk page?
  • Defining and adding categories?
  • Store update history in database?
  • create a WP category for every GO category? (Piggy back with Enzyme class effort?)
  • expand to create pages for each disease using {{Infobox_Disease}}
  • second bot to wikilink common biology concepts, specifically on pages with PBB_Controls
  • tag review articles in "Further reading" section with REVIEW (see User_talk:ProteinBoxBot/Archives/Archive1#Alternative_Idea)
  • change {{Gene}} templates to internal wikilinks
  • change spacing pattern (e.g., [1])
  • integration with WikiPathways.org

[edit] Done

  • get structure image from RSCB Done!
    • not sure yet how to get links from genes to PDB entries
    • SCB public domain license is here or here.
  • modify orthologs box to automatically adjust rows and columns based on data Done! (I think)...
  • possible add a comment to the protein box area saying that changes (to the protein box only) will be overwritten by the next bot update; this may help us from having to worry about manual edits -- AND/OR -- allow users to manually enter comment in protein box to prevent bot from overwriting Done! through the PBB_Controls template.
  • use "Category: Human proteins" instead of simply "Proteins" Done!
  • add "Category: Gene from chromosome N" Done!

[edit] Removed

  • second bot to create redirects from gene aliases Removed! better for a human to do
  • add a comment <!--Add additional text here--> to make it clear where people can/should edit... Removed! better constrain areas for PBB edits
  • changing redirects so that primary title is HGNC name
    • maybe just flag these for manual inspection Removed! A human should handle anything with regards to page moves.
  • adding links to page (e.g., "ITK") from alternate symbols (e.g., EMT; LYK; PSCTK2; MGC126257; MGC126258) and full gene name (e.g., IL2-inducible T-cell kinase)
    • is redirecting from alternate symbols really a good idea? How would one list ITK on the EMT disambiguation page? Removed! Better that a human does this.

[edit] Some suggestions

I think it was a great idea to pick up summaries from Entrez Gene. Would it be possible also to take the following information from Entrez Gene and include it in the article (let's consider NAT2 as an example):

  1. EC number - to establish relation with enzymes in WP
  2. Compartment/localization (cytoplasm in this case)
  3. Alternative names of the protein
  4. Links to PFAM and COGs (through "Conserved Domains" link in Entrez Gene entry) that would greatly help to relate the article for individual protein with article for family
  5. Link to Human Protein Reference Database entry - great resource in WP and Human Genome context
  6. Content of "Function", "Catalitical activity", and "Subcellular localization" fields from UniProt entry?

That would significantly facilitate all further work with articles generated by the bot. Thanks a lot for making this bot. It is already very helpful! Biophys 01:45, 4 November 2007 (UTC)

Good suggestions. My comments, in order...
  1. EC Number is high on the list, especially with the parallel effort on EC categories. The data is loaded in the database, and we're just waiting for the next release cycle to put it on the server that PBB queries. Then we'll have to see how much work it is to incorporate that data into PBB. Actually, if you want to make it easier for JonSDSUGrad (and hence more likely for EC to be included sooner rather than later), it would be great if you (or anyone else) would modify the Template:GNF_Protein_box to allow the additional optional parameter and then manually modify one example protein to serve as a template.
  2. "Cytoplasm" should definitely have been caught. Check out p53 for example -- cellular compartment is definitely one of the top-level GO categories. The only thing I can think of is that the "cytoplasm" tag was added since we last uploaded the NCBI data in our local mirror. Just checked, and in the next data update coming, NAT2 is indeed annotated with cytoplasm.
  3. PBB does list aliases next to the official gene name in the "Symbol(s)" section of the infobox. I believe we get that from the section that NCBI labels as "Also known as" ([2]).
  4. yup, will definitely figure out how to get protein family info in at some point. We're working off the files in NCBI's ftp site, so if you know specifically which file we can get this data from, please let me know. I can't find it in any of these files, which is where we get most of our other data.
  5. HPRD should be relatively straightforward to add. I'll add it to the list of things to do.
  6. I would guess that these would overlap heavily with the GO annotations directly in NCBI, no? Also, can't seem to find that information (here, for example) under the headings you list.
And I'll end with one last general thought. There is clearly a ton of "structured data" out there (generally following a tag / value format) that we could include, but there are also a ton of web sites that are very good at displaying structured data (primarily the ones we're taking data from). I think we at WP would be foolish to try to compete directly and include everything under the sun. Rather, I'd rather just provide the highlights of the structured data and links to get readers within a couple of clicks of all available information. Where WP really shines, of course, is for all that "unstructured data" -- primarily the free text knowledge that people write and contribute -- for which there is no current resource to capture and display. I think I stated it somewhere that my goal for PBB is to nucleate free-text contributions that will eventually grow into a continually-updated gene-specific review article for every gene in the human genome. Anyway, bottom line is that we have a pretty functional stub already and the first priority will be on expanding coverage of genes. AndrewGNF 04:20, 5 November 2007 (UTC)
Thank you for good and precise answer! Some comments in reverse order. I agree with your 'general though' - in general. Of course, we should not duplicate other resources. But a few points mentioned here would be helpful in WP context and you seem to agree.
6. I believe that UniProt records in the mentioned fields are significantly more reliable and informative than GO records. Hence my suggestion. It would be helpful to include this information in the body of the article like Etrez Genez summaries. If the fields are empty (which is sometimes the case), there is nothing to include of course. But if this is difficult, then do not do it. The bot is aleady very helpful.
5. Thank you. Links to HPRD would definitely help. I will think about other points.Biophys 16:53, 5 November 2007 (UTC)
On #6, can you point me to the exact place in a record that has this info? I can't seem to find where it's located... Thanks, AndrewGNF 18:37, 5 November 2007 (UTC)

[edit] ProteinBoxBot 2.0

From WP perspective, we ideally need the following in the future.

1. Articles about all human genes generated by the bot and list(s) of the genes, probably arranged by chromosomes and by locations in the chromosomes.
2. Articles about all SMART domains (could be also generated by a bot). It is the protein domains (not polypeptide chains) are units of protein evolution and function.
3. An article about each human protein generated by the bot should also include: (a) a Table with domains of this proteins (for a graphical view of domain structure see this: [3]), and (b) a Table of human protein interaction partners (note that such Tables in Entrez Gene are redundant and may include the same partner several times). The Tables will be more readable than text. Each domain and each protein partner in the Tables should be automatically wikilinked, which is easy. Together with Summary, that would be a really comprehensive encyclopedic resource!Biophys 18:25, 14 November 2007 (UTC)