Talk:Million Book Project
From Wikipedia, the free encyclopedia
[edit] Stauts
It's 2005 and a quick Google search doesn't suggest much about the current status of the project; the latest info in the FAQ at http://www.library.cmu.edu/Libraries/MBP_FAQ.html#current is only recent as of June 2004. Does anyone have more information on this?
-- Schultz.Ryan 01:11, 12 Jan 2005 (UTC)
I note that the following annotation of the web page shows that work is continuing.
March 20, 2006 -- http://www.library.cmu.edu/Libraries/MBP_FAQ.html Denise Troll, Associate Dean of University Libraries, troll@andrew.cmu.edu
Ms. Denise Troll Covey cuurently has the title Principal Librarian for Special Projects, Carnegie Mellon. She can be contacted at the e-mail address shown aove.
[edit] Recent addition
The below mass of text was cut and pasted into the article by Denise Troll Covey. I've removed it here, if anyone is up to turning into an encyclopedia article and wants to re-add it. -- Stbalbach 21:15, 30 April 2007 (UTC)
Million Book Project update as of April 30, 2007
The Million Book Project has exceeded its goal of digitizing one million books by 2007. The Project inspired other large-scale digitization projects, including Google Book Search, by changing worldwide thinking about the presentation of material found in books.
Leveraging the $3,000,000 provided by the National Science Foundation for equipment and travel, the Million Book Project attracted international partners and matching funds exceeding $100 million U.S. dollars. To date the Project has scanned over 1.4 million books in China, India and Egypt, and made great strides in research areas relevant to large-scale, multi-lingual database storage and retrieval.
Though the initial term of the Million Book Project has ended, much work remains to be done. Project partners plan to continue to work together on the following issues:
- Intellectual property: Copyright remains the biggest barrier to creating the digital library. In the United States, all materials published after 1963 are protected by copyright for the life of the author plus seventy years. Materials published prior to 1923 are out of copyright. In the interim from 1923 through 1963, copyright required renewal. Estimates are that 90% of the materials published during this period were not renewed and are therefore out of copyright. However, renewal records must be consulted for each title to determine its copyright status. Copyright renewal records were scanned to enable online consultation, and later re-keyed by Distributed Proofreaders to improve accuracy and facilitate searching. Project partner Michael Lesk developed the search system. Nevertheless, the labor cost of manually searching individual titles is cost prohibitive for large-scale projects. Partners at the Internet Archive are developing software to automate this process.
- Machine translation and summarization: The vision of the universal digital library includes automatic translation from any language to any language of both queries submitted and content retrieved. Million Book Project director and director of the Language Technologies Institute (LTI) at Carnegie Mellon, Dr. Jaime Carbonell, has been exploring context-based machine translation, a technique that mines the broad resources of the web to find examples to facilitate translation. Project partners in China and India are also working on machine translation. India, a country with eighteen official languages, is heavily invested in this work. LTI is also developing summarization technology. Automated summaries can help address the dual problems of information overload and lack of time by quickly enabling users to determine relevance if not find exactly what they need. In combination with machine translation, automated summaries can provide people with access to information that might never be translated into their native language. The implications for teaching, learning, research and innovation would be profound.
- Improving and providing centralized access to the metadata: The initial plan of the Million Book Project was to host the entire collection at Carnegie Mellon and to have mirror sites around the world. File transfer, however, turned out to be a significant problem for technical and political reasons. Given these hurtles and developments in distributed computing over the past five years, the current plan is for each country to host the material that it scans, but to provide centralized access to the metadata. Inaccuracies and non-standard cataloging practices must be addressed to make this possible. This will be a primary focus of work over the next year.
- Usability: The books in the Million Book collection are stored as TIFF files, one file per page. The files are large and fetching each page can be tedious over inferior or busy networks. Project director Dr. Raj Reddy is exploring correcting the optical character recognition text to provide HTML versions of the books or converting the books to Portable Document Format (PDF). HTML and PDF files are much smaller files than TIFF files so transmission speeds would be much faster. Time is a critical factor for students and faculty. The time between page fetches affects reading comprehension. Work must be done to improve the usability of the collection.
- Growing the collection: In addition to the work and research described above, project partners aim to continue efforts begun in 2005 to create a critical mass of best practices literature in agriculture around the world. In partnership with the Food and Agriculture Organization, the National Agriculture Library and relevant university libraries, additional agricultural materials will be scanned and added to the Million Book collection. Project partners will also continue to add to the collection books and other materials in different languages and disciplines.
- Diversity and education: The Million Book Project has always had goals in support of diversity and education. Our efforts to provide a multi-lingual digital library aim to address the inordinate amount of web content in the English language and the inordinate amount of web content of dubious quality for teaching, learning and scholarship. In conjunction with work on machine translation and summarization, the Project looks to a future where all people can find the quality information they need free-to-read on the web.
Many students rely on the web as their information resource, turning first to Google or another internet search engines to mine the surface web and only secondarily to library licensed, restricted-access resources in the deep web. Print is a third, somewhat unpopular choice. The lack of quality information on the surface web and its impact on student learning is a primary driver of the Million Book Project. The problem is particularly acute in the sciences, where little relevant information is out of copyright and therefore readily available for digitization. Public policy and innovative technology, like machine translation and summarization of factual book content, must be explored to meet the needs and expectations of students, scholars, and lifelong learners everywhere.
The best practices and research initiatives that result from the project will continue to be shared with librarians and scientists worldwide through formal and informal channels. Applied research and best practices will enhance the quality of digitized materials, storage and delivery systems, and ultimately the users’ experience. Providing powerful tools and free-to-read access to materials in many disciplines will support education and lifelong learning. Free access to agricultural collections can help reduce hunger and food insecurity. The Million Book collection will be indexed by Google and other popular search engines. The Project will continue to drive research agendas in many areas, from computer science to public policy.