User:ProteinBoxBot/Project proposals
From Wikipedia, the free encyclopedia
[edit] Introduction
The ProteinBoxBot (PBB) has a goal of creating and/or enhancing ~10,000 Wikipedia pages corresponding to human genes. PBB harvests data from various public databases and formats them for appropriate visualization in Wikipedia. We hope (and preliminary data suggest) that this structured data from the public domain will then seed contributions of unstructured knowledge from the broader biological community. Although we offer several specific project ideas below, any proposals from students which broadly fit this mission will be considered!
[edit] Background
The first version of PBB was created by JonSDSUGrad in Java, in part using the Java Wiki Bot Framework. As of February 2008, PBB has been used to amend ~650 existing gene pages and create ~8000 gene pages. (All PBB pages are listed here.) This project has been supported by the Molecular and Cellular Biology Wikiproject. By all accounts, the Version 1 PBB has been quite successful. Over 65% of PBB pages show up on the first page of a google search (see figure), and edits to newly-created PBB pages account for approximately half of all gene page edits. Source code of PBB has been released at Google Code.
Despite the success of Version 1, many enhancements to the PBB code and resulting Wikipedia pages are needed. The ideas below represent some possible projects that will further improve the utility of the gene page stubs and attract more editors to these pages. The required expertise for the projects listed below ranges from a beginning undergrad in computer science, biology, or bioinformatics, all the way up to master's students.
[edit] Questions?
Questions about PBB or any of the projects below? Post on the talk page! This is intrinsically a wiki-based project (not just for the project's output, but for coordination, brainstorming, and discussion), so you might as well learn now! And although there will be one official mentor for projects, everything we do is in the context of the Wikipedia community, which effectively serves as an unlimited supply of mentors.
[edit] Ideas List
[edit] Create web form to auto-create PBB content based on Entrez Gene ID
Description: This project is essentially to put a web-front end to the existing PBB code. Individual Wikipedia users often want to create or update PBB content for their favorite gene, and a web form to take a gene identifier, run the PBB code, and produce formatted wikicode would be well-used. This tool would be modeled after a similar tool available here.
Difficulty: Low
Skills required: Past experience with HTML and CGI or ability to learn quickly.
[edit] Automatically turn off summary updates if wikilinking is detected in PBB_Summary
Description: PBB uploads a text summary in the {{PBB Summary}} template. Often, editors add wikilinks in that text without turning off the automatic summary update in {{PBB Controls}}. This enhancement would automatically detect wikilinking and turn off the summary update.
Difficulty: Low
Skill required: Java familiarity
[edit] Automate uploads of PDB images
Description: PBB uploads two-dimensional protein images for all gene pages which have a protein structure available (e.g., Image:PBB_Protein_CDK2_image.jpg). This new module of PBB will periodically upload new images according to current release of PDB. Images should be categorized by protein family classification (SCOP).
Difficulty: Medium/low
Skills required: Familiarity with programmatic interfaces with wikis is useful, but not required.
[edit] Add protein domain information to gene pages
Description: See discussion here.
Difficulty: Medium
Skills required: Java, Basic understanding of protein structure
[edit] Systematic classification of gene pages by protein family
Description: Genes and proteins are also classified by protein family. For example, this page shows all the protein domains that are found in the gene BTK. (Protein domain IDs begin with "IPR"...) This new module of PBB would parse these data from database dumps, create the appropriate categories in Wikipedia, and assign all relevant genes to each category.
Difficulty: Medium/hard
Skills required: Familiarity with the molecular biology and/or programmatic interfaces with wikis is useful, but not required.
[edit] Visualization of local protein-interaction neighborhood
Description: Proteins often physically interact with other proteins to form protein complexes. This new PBB module will manage protein interaction data from the public domain. Specifically, it will download data from HPRD (e.g., [1]) to identify interaction partners for each protein. Then this module will populate a new Protein Interactions template for all gene stubs at Wikipedia.
Difficulty: Medium/hard
Skills required: Familiarity with programmatic interfaces with wikis is useful, but not required.
[edit] Add {{Wikiproject MCB}} to gene talk pages
Description: For all PBB runs, check to make sure {{Wikiproject MCB}} appears on talk page. If not, create it.
Difficulty: Medium
Skills required: Java
[edit] Various technical fixes
Description: Implement various technical fixes, including:
-
- Change uploaded PDB image name to PDB ID instead of Gene Symbol, e.g,. Image:PBB Protein 2CPC image.jpg instead of Image:PBB Protein OBSL1 image.jpg (made obsolete by bigger PDB project above)
- change spacing pattern in output (e.g., [2])
- change expression images to upload to Wikipedia:Wikimedia Commons
- tag review articles in "Further reading" section with REVIEW (see User_talk:ProteinBoxBot/Archives/Archive1#Alternative_Idea)
- remove pubmed IDs for large-scale cloning papers, e.g., PubMed
Difficulty: Low
Skills: Java