Multi-document summarization

From Wikipedia, the free encyclopedia

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. Resulting summary report allows individual users, so as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents. In such a way, multi-document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload.

Contents

[edit] Key benefits

Multi-document summarization creates information reports that are both concise and comprehensive. With different opinions being put together & outlined, every topic is described from multiple perspectives within a single document. While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should itself contain the required information, hence limiting the need for accessing original files to cases when refinement is required. Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.

[edit] Technology challenges

The multi-document summarization task has turned out to be much more complex than summarizing a single document, even a very large one. This is evidently due to inevitable thematic diversity within a large documents set. A good summarization technology aims at combination of the main theme compliance and completeness, good readability and conciseness. Document Understanding Conferences, conducted annually by NIST, have developed sophisticated evaluation criteria for techniques accepting the multi-document summarization challenge.

An ideal multi-document summarization system not just shortens the source texts but presents information organized around the key aspects so as the wider diversity of views on the topic. When such quality is achieved, an automatic multi-document summary is perceived more like an overview of a given news topic. The latter implies that such text compilations should also meet other basic requirements for an overview text compiled by a human.

The multi-document summary quality criteria are basically as follows:

  • clear structure, including an outline of the main contents items, from which it is easy to navigate to the full text sections
  • text within sections is divided into meaningful paragraphs
  • gradual transition from more general to more particular thematic aspects
  • good readability.

The latter point deserves additional note - special care is taken in order to ensure that the automatic overview shows:

  • no paper-unrelated "information noise" from the respective documents (e.g., web pages)
  • no dangling references to what is not mentioned or explained in the overview
  • no text breaks across a sentence
  • no semantic redundancy.

[edit] Real-life systems

The multi-document summarization technology is now coming of age - a view supported by a choice of advanced web-based systems that are currently available.

Newsblaster is a system that helps users find the news that is of the most interest to them. The system autmatically collects, clusters, categorizes, and summarizes news from several sites on the web (CNN, Reuters, Fox News, etc.) on a daily basis, and it provides users a user-friendly interface to browse the results.

NewsInEssence may be used to retrieve and summarize a cluster of articles from the web. It can start from a URL and retrieve documents that are similar, or it can retrieve documents that match a given set of keywords. NewsInEssence also downloads hundreds of news articles daily and produces news clusters from them.

NewsFeed Researcher is a news portal performing continuous automatic summarization of documents initially clustered by the news aggregators (e.g., Google News). NewsFeed Researcher is backed by the free online engine covering major events related to business, technology, U.S. and international news. This tool is also available in the on-demand mode allowing a user to build a summary on any selected topic.

As the quality multi-document summaries are becoming to resemble the overviews written by a human, one cannot exclude that their use of extracted text snippets can one day face some copyright issues. This potential case should be regarded from the point of the fair use copyright concept.

[edit] Bibliography

  • C.-Y. Lin, E. Hovy, "From single to multi-document summarization: A prototype system and its evaluation", In "Proceedings of the ACL", pp. 457–464, 2002
  • Kathleen McKeown, Rebecca J. Passonneau, David K. Elson, Ani Nenkova, Julia Hirschberg, "Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization", SIGIR’05, Salvador, Brazil, August 15–19, 2005 [1]
  • R. Barzilay, N. Elhadad, K. R. McKeown, "Inferring strategies for sentence ordering in multidocument news summarization", Journal of Artificial Intelligence Research, v. 17, pp. 35-55, 2002
  • M. Soubbotin, S. Soubbotin, "Trade-Off Between Factors Influencing Quality of the Summary", Document Understanding Workshop (DUC), Vancouver, B.C., Canada, October 9-10, 2005 [2]
  • A. Lehmam, "Text structuration leading to an automatic summary system ", Information Processing & Management, 35, pp. 181-191, 1999, Elsevier Science Ltd, NJ, New York, USA

[edit] See also

[edit] External links