What Is Natural Language Processing?
Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics, but there are important differences. Natural language processing is not a general study of natural language, but rather the development of computer systems, especially software systems, that can effectively implement natural language communication. It is therefore part of computer science.
- Chinese name
- Natural language processing
- Foreign name
- natural language processing
- Application area
- Computer, artificial intelligence
- Abbreviation
- NLP
- Add-one (Laplace) Smoothing
- Good-Turing Smoothing
- InterpolationSmoothing
- First, determine three types of data: Training data, Held-out data, and Test data;
- Then, construct the initial language model according to the training data and determine the initial s (if both are 1);
- Finally, iteratively optimizes s based on the EM algorithm to maximize the Held-out data probability (as shown below).
- Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics, but there are important differences. Natural language processing is not a general study of natural language, but rather the development of computer systems, especially software systems, that can effectively implement natural language communication. It is therefore part of computer science.
- Natural language processing (NLP) is an area of computer science, artificial intelligence, and linguistics that focuses on the interaction between computers and human (natural) language.
Detailed introduction to natural language processing
- Language is the essential characteristic of human beings to distinguish other animals. Of all living things, only humans have the ability to speak. Many human intelligences are closely related to language. Human logical thinking is in the form of language, and most of human knowledge is also recorded and passed down in the form of language. Therefore, it is also an important or even core part of artificial intelligence.
- Communicating with computers using natural language has long been sought after. Because it has both obvious practical significance and important theoretical significance: people can use computers in the language they are most accustomed to, without having to spend a lot of time and energy on learning various computer languages that are not very natural and customary; People can also use it to further understand human language capabilities and intelligent mechanisms.
- Achieving natural language communication between humans and computers means that computers must be able to both understand the meaning of natural language texts and express given intents, ideas, etc. in natural language texts. The former is called natural language understanding, and the latter is called natural language generation. Therefore, natural language processing generally includes two parts: natural language understanding and natural language generation. Historically, there has been more research on natural language understanding, but less research on natural language generation. But this situation has changed.
- Whether it is natural language understanding or natural language generation, it is far less simple than people originally thought, but it is very difficult. From the current theory and technology status, a universal, high-quality natural language processing system is still a long-term goal, but for certain applications, practical systems with considerable natural language processing capabilities have appeared, and some have been commercialized. And even started industrialization. Typical examples are: natural language interfaces of multilingual databases and expert systems, various machine translation systems, full-text information retrieval systems, automatic abstraction systems, etc.
- Natural language processing, that is, the realization of natural language communication between humans and computers, or the realization of natural language understanding and natural language generation is very difficult. The root cause of the difficulty is the wide variety of ambiguities or ambiguities that exist at all levels of natural language text and dialogue.
- A Chinese text is formally a string of Chinese characters (including punctuation marks, etc.). Words can be composed of words, words can be composed of phrases, phrases can be composed of sentences, and then some sentences can be composed of paragraphs, sections, chapters, and articles. Whether at the above levels: word (symbol), word, phrase, sentence, paragraph, ... or in the next level to the next level, there is ambiguity and polysemy, that is, a string of characters in the same form In different scenarios or different contexts, it can be understood as different word strings, phrase strings, etc., and have different meanings. In general, most of them can be solved according to the corresponding context and scenario. That is, there is no ambiguity in general. This is why we usually do not feel the ambiguity of natural language and can communicate correctly with natural language. But on the one hand, we also see that in order to dispel ambiguity, an extremely large amount of knowledge and reasoning are required. How to collect and sort out this knowledge more completely; how to find suitable forms and store them in computer systems; and how to effectively use them to eliminate ambiguities are extremely difficult and difficult tasks. . This is not something that a few people can accomplish in a short period of time, but also long-term, systematic work.
- What has been said above is that a Chinese text or a string of Chinese characters (including punctuation marks, etc.) may have multiple meanings. It is a major difficulty and obstacle in natural language understanding. Conversely, an identical or similar meaning can also be represented by multiple Chinese texts or multiple Chinese character strings.
- Therefore, there is a many-to-many relationship between the form (string) of natural language and its meaning. This is actually the charm of natural language. But from the perspective of computer processing, we must eliminate ambiguity, and some people think that it is the central problem in natural language understanding, that is, to convert the potentially ambiguous natural language input into some kind of unambiguous computer internal representation.
- The widespread existence of ambiguity makes eliminating them requires a lot of knowledge and reasoning, which has brought great difficulties to linguistic-based and knowledge-based methods. Therefore, these methods have been the mainstream of natural language processing research for decades. In terms of theory and method, many achievements have been made, but in the development of systems capable of handling large-scale real texts, the achievements are not significant. Most of the systems developed are small-scale, research-based demonstration systems.
- At present, there are two problems: on the one hand, the grammars so far have been limited to the analysis of an isolated sentence, and the constraints and effects of the context and the speaking environment on this sentence have not been systematically studied. There are no clear rules to follow about the meanings of different meanings of the same sentence on different occasions or by different people, and the research on pragmatics needs to be strengthened to solve it gradually. On the other hand, people do not understand a sentence based on grammar alone. They also use a lot of relevant knowledge, including life knowledge and specialized knowledge, which cannot be stored in a computer. Therefore, a written understanding system can only be established within a limited range of vocabulary, sentence patterns, and specific topics; after the storage capacity and operating speed of the computer have greatly increased, it is possible to appropriately expand the scope.
- The above problems have become a major problem in the application of natural language understanding in machine translation, which is one of the reasons why the quality of today's machine translation systems is far from the ideal target; and the quality of translation is the key to the success of machine translation systems. Chinese mathematician and linguist Professor Zhou Haizhong once pointed out in the classic paper "Fifty Years of Machine Translation": To improve the quality of machine translation, we must first solve the problem of the language itself, not the problem of program design; As a machine translation system, it is certainly impossible to improve the quality of machine translation. In addition, in the case that humans have not yet understood how the brain performs fuzzy recognition and logical judgment of language, it is impossible for machine translation to reach the level of "faithfulness, elegance, and elegance". possible.
History of Natural Language Processing
- The earliest research work on natural language understanding was machine translation. In 1949, American Weaver first proposed the design of machine translation. In the 1960s, there was a large-scale research work on machine translation abroad, which cost a huge amount of money, but people obviously underestimated the complexity of natural language at that time, and the theory and technology of language processing were not hot, so little progress was made. The main method is to store a large dictionary of corresponding translations of words and phrases in two languages. One-to-one correspondence during translation. Technically, only the order of the same language is adjusted. However, the translation of language in daily life is far from simple. In many cases, it is necessary to refer to the meaning before and after a certain sentence.
- Beginning around the 1990s, great changes have taken place in the field of natural language processing. Two distinct characteristics of this change are:
- (1) For system input, it is required that the developed natural language processing system be able to process large-scale real texts, instead of only a few entries and typical sentences, as in previous research systems. Only in this way can the developed system have real practical value.
- (2) In view of the output of the system, it is very difficult to truly understand natural language. The system does not require deep understanding of natural language text, but it must be able to extract useful information from it. For example, automatic extraction of index words, filtering, retrieval, automatic extraction of important information, automatic abstraction of natural language text, and so on.
- At the same time, due to the emphasis on "large-scale" and "real text", the basic work in the following two aspects has also been emphasized and strengthened.
- (1) Development of large-scale real corpus. A large-scale corpus of real texts processed at different depths is the basis for studying the statistical properties of natural language. Without them, statistical methods can only be passive water.
- (2) Compilation of large-scale and informative dictionaries. The scale of tens of thousands, hundreds of thousands, or even hundreds of thousands of words. Computer-available dictionaries containing rich information (such as collocation information containing words) are very important for natural language processing.
Natural language processing related content
- Natural language processing (NLP) is an area of computer science, artificial intelligence, and linguistics that focuses on the interaction between computers and human (natural) language. Therefore, natural language processing is related to the field of human-computer interaction. There are many challenges in natural language processing, including natural language understanding, so natural language processing involves an area of human-computer interaction. Many of the challenges in NLP involve natural language understanding, meaning that computers originate from human or natural language input, and others involve natural language generation.
- Modern NLP algorithms are based on machine learning, especially statistical machine learning. The machine learning paradigm is different from previous attempts at language processing. The realization of language processing tasks usually involves the coding of a large set of rules directly by hand.
- Many different classes of machine learning algorithms have been applied to natural language processing tasks. The input to these algorithms is a large set of "features" generated from the input data. Some of the earliest algorithms, such as decision trees, generate hard if-then rules similar to handwritten rules, and are no more common systems. However, more and more research is focused on statistical models, which makes each input feature soft and probabilistic decision-making based on additional real-valued weights. Such models have the advantage of being able to express many different possible answers, rather than having only one relative certainty, which is included as part of a larger system when producing more reliable results.
- Natural language processing research has gradually shifted from the semantic meaning of lexical semantic components to further, narrative understanding. However, human-level natural language processing is a complete problem with artificial intelligence. It is the equivalent of solving a central artificial intelligence problem that makes computers as intelligent as humans, or powerful AI. The future of natural language processing is generally closely integrated with the development of artificial intelligence. [1]
Natural language processing related technologies
- Data sparseness and smoothing techniques
- Large-scale data statistics methods and limited training corpus will inevitably produce data sparseness problems, leading to zero probability problems, which conforms to the classic zip'f law. For example, IBM, Brown: 366M English corpus training trigram. In the test corpus, 14.7% of the trigram and 2.2% of the bigram did not appear in the training corpus.
- Data sparse problem definition: "The problem of data sparseness, also known as the zero-frequency problem arises when analyzes contains configurations that never occurred in the training corpus. Then it is not possible to estimate probabilities from observed frequencies, and some other estimation schemethat can generalize (that configurations ) from thetraining data has to be used. Dagan ".
- Many attempts and efforts have been made for the practical application of theoretical models, and a series of classic smoothing techniques have been born. Their basic idea is to "reduce the conditional probability distribution of n-grams that have appeared so that the conditional probability distribution of n-grams that have not appeared Non-zero ", and after the data is smoothed, the probability sum is guaranteed to be 1, as follows:
- Add a smoothing method, also known as Laplace's law, which guarantees that each n-gram appears at least once in the training corpus. Taking the bigram as an example, the formula is as follows:
- formula
- Where V is the number of all bigrams.
- The basic idea is to smooth the frequency using the category information of the frequency. Adjust the frequency of n-grams with frequency c *:
- formula
- The direct improvement strategy is to "not smooth the gram whose occurrences exceed a certain threshold. The threshold is generally 8-10". For other methods, please refer to "Simple Good-Turing".
- Regardless of whether it is Add-one or Good Turing smoothing technology, the n-grams that have not appeared are treated equally. It is inevitable that there is irrationality (there is a difference in the probability of event occurrence), so here is a linear interpolation smoothing technology. The high-order model and the low-order model are linearly combined, and the low-element n-gram model is used to linearly interpolate the high-element n-gram model. Because when there is not enough data to estimate the probability of a high-element n-gram model, a low-element n-gram model can often provide useful information. The formula is as shown in Figure 1 on the right:
- Interpolation Smoothing
- The expansion mode (context-dependent) is as shown in Figure 2:
- s can be estimated by the EM algorithm. The specific steps are as follows:
- Extension method
Overview of Natural Language Processing
Basic Theory of Natural Language Processing
- Automata Formal Logic Statistics Machine Learning Chinese Linguistics Form Grammar Theory
Natural language processing language resources
- Corpus dictionary
Key technologies for natural language processing
- Chinese character encoding, lexical analysis, syntactic analysis, semantic analysis, text generation, speech recognition
Natural language processing application system
- Text classification and clustering information retrieval and filtering information extraction question answering system Pinyin Chinese character conversion system new information detection
Natural language processing controversy
- Although the above-mentioned new trends have brought results to the field of natural language processing, from the perspective of theoretical methods, due to the difficulty of collecting, collating, representing, and effectively applying large amounts of knowledge, these systems rely more on statistical methods and other "simple" Methods or techniques. And these statistical methods and other "simple" methods seem to be reaching their limits, so for now, one of the issues widely debated in the natural language processing community is to make new and greater progress. Is it mainly a theoretical breakthrough, or can it be achieved by the improvement and optimization of existing methods? The answer is unclear. In general, more linguists prefer the former opinion, and more engineers prefer the latter opinion. The answer may be in the "middle", that is, a deep method based on knowledge and reasoning should be combined with a "shallow" method based on statistics.
Data processing with natural language processing
- The basis of natural language processing is various natural language processing datasets, such as tc-corpus-train (corpus training set), text classification research-oriented Chinese and English news corpora, and multi-dimensionality generated by feature word selection methods such as IG chi-square Chinese VSM model in ARFF format, Chinese DBLP resources for 10,000 random extraction papers, Chinese word segmentation thesaurus for unsupervised Chinese word segmentation algorithms, UCI evaluation ranking data, sentiment analysis data set with initialization instructions, etc.
Natural language processing tools
OpenNLP OpenNLP Natural Language Processing
- OpenNLP is a Java-based machine learning toolkit for processing natural language text. Supports most common NLP tasks, such as: identification, sentence segmentation, part-of-speech tagging, name extraction, chunking, parsing, etc.
FudanNLP Natural Language Processing FudanNLP
- FudanNLP is mainly a toolkit developed for Chinese natural language processing. It also contains machine learning algorithms and data sets to achieve these tasks. This toolkit and its contained data sets are licensed under the LGPL 3.0. The development language is Java.
- Features:
- Text classification news clustering
- 2. Chinese word segmentation, part-of-speech tagging, entity name recognition, keyword extraction, dependency syntax analysis, time phrase recognition
- 3. Structured Learning Online Learning Hierarchical Classification Clustering Accurate Inference
(LTP) Natural Language Processing Language Technology Platform (LTP)
- Language Technology Platform (LTP) is a set of Chinese language processing system developed by Harbin Institute of Technology Social Computing and Information Retrieval Research Center for ten years. LTP formulated the XML-based language processing result representation, and based on this provided a set of bottom-up rich and efficient Chinese language processing modules (including lexical, syntactic, semantic and other 6 core Chinese processing core technologies), and based on Application program interface, visualization tool of Dynamic Link Library (DLL), and can be used in the form of Web Service.
Technical difficulties of natural language processing
Boundary definition of natural language processing words
- In spoken language, words are often coherent, and the way to define word boundaries is usually to use the best combination that makes the given context most fluent and grammatically correct. In writing, there is no boundary between words in Chinese.
Disambiguation of word meaning in natural language processing
- Many words have more than one meaning, so we must choose an explanation that makes the meaning of the sentence most fluent.
Fuzziness of Natural Language Processing Syntax
- The grammar of natural language is usually ambiguous. For a sentence, it is possible to parse (Parse) multiple parse trees, and we must rely on semantic and contextual information to choose the most suitable one. tree.
Natural language processing with defective or irregular input
- For example, foreign accents or local accents are encountered in speech processing, or spelling, grammar, or optical character recognition (OCR) errors are processed in text processing.
Natural Language Processing Language Actions and Plans
- Sentences are often not just literal; for example, "Can you pass the salt?" A good answer should be to pass the salt; in most contexts, "can" would be a bad answer, although Answering "No" or "Too Far I Can't Get It" is also acceptable. Furthermore, if a course has not been opened in the past year, it is better to answer the question "How many students failed this course last year?"