What Is a Text Corpus?
Three basic understandings about the corpus: the corpus stores the language materials that have actually appeared in the actual use of the language; the corpus is the basic resource for carrying language knowledge with the computer as the carrier; the real corpus needs to be processed (analyzed and processed) In order to become a useful resource.
Corpus
- Three basic understandings about the corpus: the corpus stores the language materials that have actually appeared in the actual use of the language; the corpus is the basic resource for carrying language knowledge with the computer as the carrier;
- Can be roughly divided into three categories:
- The first is to study the alignment technology of bilingual corpora (Alignment). Scholars at home and abroad have proposed a variety of strategies and methods, and many procedures or tools for aligning bilingual or multilingual corpora have appeared [Gale 1993];
- The second is to study the various applications of bilingual corpora. For example, in statistical-based machine translation technology [Brown 1990], instance-based machine translation technology [Nagao 1984], bilingual dictionary compilation [Klavans and Tzoukermann 1990] technologies, bilingual corpora all play a role Play a very important role;
- The third is the design, collection, coding and management of bilingual corpora. The more well-known corpus encoding schemes include the TEI text encoding standard and the CES standard, both of which are based on the SGML markup language research
- A corpus of more than one language. There are two types of parallel corpus and control corpus. Parallel corpus means that two or more texts in the database are translations of each other, so it can be used for translation or machine translation research; texts in two or more languages in the control corpus do not constitute a translation relationship, but the fields are the same and the theme similar. It can only be used to compare two or more languages.
- For the first two types of research, China has done a lot of follow-up research work. For the third type of research, that is, the construction, coding, and management of bilingual corpora, especially bilingual corpora involving Chinese, the exploration seems to be relatively less.
- At present, the largest corpus exchange platform in China is Tmxmall Corpus Mall.