Text segmentation

From Wikipedia, the free encyclopedia

Text segmentation is the process of dividing written text into words or other similar meaningful units, such as sentences or topics. The term applies to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing.

The problem may appear relatively trivial for written languages that have explicit word boundary markers, such as the word spaces of written English or the distinctive initial, medial and final letter shapes of Arabic. When such clues are not consistently available, the task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints.

Natural Language Processing (NLP) text segmentation techniques involves determining the boundaries between words and sentences. This process is not as simple as finding periods (a period may appear for example in a dollar amount), semicolons (may appear for example in an XML entity tag), etc.

When processing plain text, tables of abbreviations that contain periods (Mr. for example) can help prevent incorrect assignment of sentence boundaries. Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

A document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases one needs to use techniques similar to those used in document classification. Many different approaches have been tried.[1][2]

Effective Natural Language Processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.

The process of writing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:

  • Manual analysis of text and writing custom software
  • Annotate the sample corpus with boundary information and use Machine Learning

[edit] See also

[edit] External link

[edit] References

  1. ^ Freddy Y. Y. Choi (2000). "Advances in domain independent linear text segmentation". Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00): 26–33. 
  2. ^ Jeffrey C. Reynar (1998). "Topic Segmentation: Algorithms and Applications" (PDF). IRCS-98-21. . University of Pennsylvania Retrieved on 2007-11-8.