What Is Lexical Analysis?
Lexical analysis (English: lexical analysis ) is a process in computer science that converts a sequence of characters into a sequence of words (Token). The program or function for performing lexical analysis is called a lexical analyzer (Lexer), also called a scanner . Lexical analyzers generally exist in the form of functions for the parser to call. Programs that perform lexical analysis tasks are called lexical analysis programs or lexical analyzers or scanners. [1]
- Chinese name
- lexical analysis
- Field
- Compilation principle
- Nature
- Recognize words according to the lexical rules of the language
- Lexical analysis (English: lexical analysis ) is a process in computer science that converts a sequence of characters into a sequence of words (Token). The program or function for performing lexical analysis is called a lexical analyzer (Lexer), also called a scanner . Lexical analyzers generally exist in the form of functions for the parser to call. Programs that perform lexical analysis tasks are called lexical analysis programs or lexical analyzers or scanners. [1]
- Programs that perform lexical analysis tasks are called lexical analysis programs or lexical analyzers or scanners. Scan the source program from left to right, identify various types of words according to the lexical rules of the language, and generate attribute words for the corresponding words. [1]
Introduction to Lexical Analysis
- Lexical analysis (English: lexical analysis ) is a process in computer science that converts a sequence of characters into a sequence of words (Token). The program or function for performing lexical analysis is called a lexical analyzer (Lexer), also called a scanner . Lexical analyzers generally exist in the form of functions for the parser to call.
- The lexical analysis phase is the first phase of the compilation process and is the basis of compilation. The task at this stage is to read the source program character by character from left to right, that is, to scan the character stream that composes the source program and then recognize words (also called word symbols or symbols) according to the word formation rules. The lexical analysis program accomplishes this task. Lexical analysis programs can be generated automatically using tools such as Lex.
- Lexical analysis is the first and necessary stage of the compiler; the core task of lexical analysis is to scan and recognize words and give qualitative and fixed-length processing to the recognized words; the common way to implement lexical analysis programs: automatic generation , Generated manually. [1]
Lexical Analysis Compilation
- Compilation (compilation) 1. The process of generating a target program from a source program written in a source language using a compiler. 2. Use the compiler to generate the action of the target program. Compiling is to change the high-level language into a binary language that the computer can recognize. The computer only knows 1 and 0. The compiler changes the familiar language to binary. The working process of a compiler translating a source program into a target program is divided into five stages: lexical analysis; syntax analysis; semantic checking and intermediate code generation; code optimization; and target code generation. It is mainly used for lexical analysis and grammatical analysis, also known as source program analysis. Syntax errors are found during analysis, and prompt information is given. [2]
Lexical analysis program
- Organize input, scan, analysis, output; [1]
- Receives source programs in the form of strings, scans the source programs in order according to the input order of the source programs, and at the same time scans, identifies words with independent meanings according to the lexical rules of the language, and generates a token stream equivalent to the source program [1]
- (1) As long as the interface is not modified, the modification made by the lexical analyzer will not affect the entire compiler, and the lexical analyzer is easy to maintain; [1]
- (2) The structure of the entire compiler is simple and clear; [1]
- (3) Effective methods and tools can be used for processing. [1]
Function of lexical analysis program
- Programs that perform lexical analysis tasks are called lexical analysis programs or lexical analyzers or scanners. [1]
- Scan the source program from left to right, identify various types of words according to the lexical rules of the language, and generate attribute words for the corresponding words. [1]
Lexical analysis of words
- The word here is a string, which is the smallest unit of source code. The process of generating words from the input character stream is called tokenization. In this process, the lexical analyzer also classifies the words.
- Lexical analyzers usually don't care about the relationship between words (belonging to grammatical analysis). For example: the lexical analyzer can recognize parentheses as words, but does not guarantee that the parentheses match.
- For the following C language expressions:
- sum = 3 + 2; After wordizing it, you can get the following content:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Words are often defined using regular expressions. Lexical analyzer generators like lex support regular expressions. The parser reads the input character stream, recognizes morphemes from it, and finally generates different types of words. If an invalid word is found, an error will be reported.
Lexical Analysis Scanner
- The first stage of lexical analysis is the scanner , which is usually based on a finite state automaton. Scanners are able to identify all sequences of characters that may be contained in words that they can process (a single such sequence of characters is the "morpheme" described earlier). For example, the word "integer" can contain all sequences of numeric characters. In many cases, the type of the word can be deduced based on the first non-whitespace character, and then the subsequent characters can be processed one by one until a character that does not belong to the character set of the type of word appears (that is, the longest consistent principle).
Design and Implementation of Lexical Analyzer
- Design and implementation of source lexical analyzer input and preprocessing of the program: [1]
Lexical analysis input buffer
- Paired and halved complementary input buffer modes.
- n: take the whole power of 2; set the flag "eof" at the end of each half zone to indicate the end of the source program read into the half zone; [1]
- B: pointer to the word w; F: pointer to scan w; [1]
Lexical analysis of the input mode of two buffers
- Preprocessor: (action) [1]
- 1) Reduce memory footprint; [1]
- 2) Lighten the burden of the scanner's substantial processing; [1]
- The main tasks of the preprocessor: 1) filter out non-word components in the source program (such as useless spaces; line breaks, etc.); 2) other preprocessing tasks: filter out comments; macro substitution; Embed. [1]
Lexical Analysis Word Generator
- Word generator
- Tokenization is the process of dividing an input string into words, and then classifying the words. The generated words are then used for parsing.
- For example, the following string: The quick brown fox jumps over the lazy dog
- The computer does not know that these are nine English words separated by spaces, it only knows that this is a string of ordinary 43 characters. The morphemes (here English words) can be separated from the input string by a certain method (here, using spaces as separators). The result after segmentation can be expressed in XML as follows:
- <sentence > <word > The </ word >
- <word > quick </ word >
- <word > brown </ word >
- <word > fox </ word >
- <word > jumps </ word >
- <word > over </ word >
- <word > the </ word >
- <word > lazy </ word >
- <word > dog </ word > </ sentence >
- However, a morpheme is just a type of character string (character sequence). To build a word, the parser needs a second-stage evaluator . The evaluator generates a "value" based on the sequence of characters in the morpheme. This "value" and the type of the morpheme constitute a word that can be sent to the parser. Some morphemes, such as parentheses, do not have a "value" and the evaluator function can return nothing. Evaluators for integers, identifiers, and strings are much more complicated. Estimators sometimes suppress morphemes, and suppressed morphemes (such as blank morphemes and annotation morphemes) are not subsequently sent to the parser.
- For example, a source program fragment of a programming language:
- net_worth_future = (assets-liabilities);
- The following word stream may be generated after parsing (spaces are suppressed):
- NAME "net_worth_future" EQUALS
- OPEN_PARENTHESIS
- NAME "assets"
- MINUS
- NAME "liabilities"
- CLOSE_PARENTHESIS
- SEMICOLON
- Although in some cases it is necessary to write a lexical analyzer by hand, in general lexical analyzers are generated using automated tools.