CORPUS FOR WHAT

Authors

  • Eşref Adalı

Keywords:

Corpus, Parallel Corpus, Balanced Corpus, Unbalanced Corpus, Language Modelling, Zift Law

Abstract

A corpus can be called a collection of words and phrases that are created to process a language in general. A competent corpus must be large and have all the features of the language. The corpus can be divided into two as “Balanced” and “Unbalanced”. The unbalanced corpus is the large corpus. It contains many texts and the source of these texts is not important. A balanced corpus is created by taking equal lengths of text from all fields in that language. We can divide the corpus into two classes according to their purpose: In terms of grammar and Natural Language Processing. The corpus that should be prepared to evaluate the developments in a language should be a balanced corpus. The texts to be included in the balanced corpus should represent that language in the best way. This type of corpus is mainly used by linguists. Corpus size must be as large as possible for NLP studies. It is appropriate to select the texts to be included in the compilation from different fields, but it is not so important to balance them.

Downloads

Published

2023-05-09