In linguistics, corpus (plural corpora) is a large and structured set of texts (now usually electronically stored and processed). A corpus may contain single texts in single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called parallel corpora.
In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as part-of-speech tagging[?], or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) are added to the corpus in the form of tags. In general, any information added to a corpus is called tagging.
Corpora (plural for corpus) are the main knowledge base in corpus linguistics.
Links:
Search Encyclopedia
|
Featured Article
|