<<Up     Contents

Corpus

In law a corpus (Latin: "body") is a set, a collection of documents and sources. See Corpus Juris Civilis.


In linguistics, corpus (plural corpora) is a large and structured set of texts (now usually electronically stored and processed). A corpus may contain single texts in single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called parallel corpora.

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as part-of-speech tagging[?], or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) are added to the corpus in the form of tags. In general, any information added to a corpus is called tagging.

Corpora (plural for corpus) are the main knowledge base in corpus linguistics.

Links:

wikipedia.org dumped 2003-03-17 with terodump