A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies. Corpus linguistics deals with the principles and practice of using corpora in language study.
Various types of Corpus
Annotated corpus - An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation.
Comparable (reference) corpus - A type of corpus used for comparison of different languages.
Monitor corpus - A type of corpus which is a growing, non-finite collection of texts, of primary use in lexicography. Monitor corpus reflects language changes in a constant growth rate of corpora, leaving untouched the relative weight of its components (i.e. balance) as defined by the parameters.
Monolingual corpus - A type of corpus which contains texts in a single language .
Multilingual corpus - A type of corpus which represents small collections of individual monolingual corpora (or sub-corpora) in the sense that they use the same or similar sampling procedures and categories for each language but contain completely different texts in those several languages (for two languages bilingual corpus).
Parallel (aligned) corpus - A type of multilingual corpus where texts in one language and their translations into other languages are aligned, sentence by sentence, preferably phrase by phrase.
Reference corpus - A type of corpus that is composed on the basis of relevant parameters and should include spoken and written, formal and informal language representing various social and situational strata.
Spoken corpus - A type of corpora that contain texts of spoken language.
Unannotated corpus - A type of corpora that are in raw states of plain text; opposed to annotated corpora.
Speech corpus – A large collection of audio recordings of spoken language. Most speech corpora also have additional text files containing transcriptions of the words spoken and the time each word occurred in the recording.
Speech corpora can be divided into two types:
(1) Read Speech - which includes
• Book excerpts;
• Broadcast news;
• Lists of words;
• Sequences of numbers.
(2) Spontaneous Speech - which includes:
• Dialogs & Meetings - between two or more people;
• Narratives - a person telling a story;
• Map-tasks - one person explains a route on a map to another;
• Appointment-tasks - two people try to find a common meeting time based on individual schedules.