Speech corpus

From Wikipedia, the free encyclopedia

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions in a format that can be used to create acoustic models (which can then be used with a speech recognition engine).

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of Speech Corpora:

(1) Read Speech - which includes:

Book excerpts
Broadcast news
Lists of words
Sequences of numbers

(2) Spontaneous Speech - which includes:

Dialogs - between two or more people (includes meetings);
Narratives - a person telling a story;
Map-tasks - one person explains a route on a map to another;
Appointment-tasks - two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.

[edit] External links

BAS – Bavarian Archive for Speech Signals
Switchboard - ISIP's Switchboard database
VoxForge - open source speech corpora

Categories: Computational linguistics | Corpus linguistics | Speech recognition

Views

Interaction

Search

This page was last modified 22:01, 31 March 2008 by Wikipedia user SynergeticMaggot. Based on work by Wikipedia user(s) Jonsafari, SmackBot, Martin 889, Simeon, Calltech and Kmaclean and Anonymous user(s) of Wikipedia.
All text is available under the terms of the GNU Free Documentation License. (See Copyrights for details.)
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3) tax-deductible nonprofit charity.
About Wikipedia
Disclaimers