Speech corpus

From Wikipedia, the free encyclopedia

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions in a format that can be used to create acoustic models (which can then be used with a speech recognition engine).

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of Speech Corpora:

  • (1) Read Speech - which includes:
  • Book excerpts
  • Broadcast news
  • Lists of words
  • Sequences of numbers
  • (2) Spontaneous Speech - which includes:
  • Dialogs - between two or more people (includes meetings);
  • Narratives - a person telling a story;
  • Map-tasks - one person explains a route on a map to another;
  • Appointment-tasks - two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.

[edit] External links