What Is OCR (Optical Character Recognition)?
OCR (Optical Character Recognition) refers to an electronic device (such as a scanner or digital camera) that checks characters printed on paper, determines the shape by detecting dark and light patterns, and then translates the shape into a computer using character recognition methods The process of text; that is, for printed characters, the text in the paper document is optically converted into a black and white dot matrix image file, and the text in the image is converted to a text format by recognition software for further processing by the word processing software Editing technology. How to debug or use auxiliary information to improve the accuracy of recognition is the most important subject of OCR, and therefore the term of Intelligent Character Recognition (ICR) has also emerged. The main indicators to measure the performance of an OCR system are: rejection rate, misrecognition rate, recognition speed, user-friendliness, product stability, ease of use, and feasibility.
Optical character recognition
- Chinese name
- OCR (Optical Character Recognition) refers to an electronic device (such as a scanner or digital camera) that checks characters printed on paper, determines the shape by detecting dark and light patterns, and then translates the shape into a computer using character recognition methods The process of text; that is, for printed characters, the text in the paper document is optically converted into a black and white dot matrix image file, and the text in the image is converted to a text format by recognition software for further processing by the word processing software Editing technology. How to debug or use auxiliary information to improve the accuracy of recognition is the most important subject of OCR, and therefore the term of Intelligent Character Recognition (ICR) has also emerged. The main indicators for measuring the performance of an OCR system are: rejection rate, misrecognition rate, recognition speed, user-friendliness, product stability, ease of use, and feasibility.
- The concept of OCR was first proposed by the German scientist Tausheck in 1929, and then the American scientist Handel also proposed the idea of using technology to recognize text. The earliest research on print Chinese character recognition was Casey and Nagy of IBM Corporation. In 1966, they published the first article on Chinese character recognition, which used template matching method to identify 1000 print Chinese characters.
- As early as the 1960s and 1970s, countries around the world began to study OCR. In the early days of research, most of the research was on text recognition methods, and the recognized texts were only numbers from 0 to 9. Taking Japan, which also has square characters, as an example, began to study the basic recognition theory of OCR around 1960. In the early days, numbers were used as objects. Until 1965 to 1970, there were some simple products, such as printed postal code recognition systems. Identify the postal code on the mail to help the post office do regional letter-breaking operations; therefore, the postal code has been the address writing method advocated by various countries so far.
- In the early 1970s, Japanese scholars began to study Chinese character recognition and did a lot of work. China's research work on OCR technology started relatively late. Only in the 1970s did research on the recognition of numbers, English letters and symbols, and research on Chinese character recognition began in the late 1970s. By 1986, China had proposed the "863" high-tech Research plan, the study of Chinese character recognition has entered a substantial stage, Tsinghua University's
- Because of the popularity and wide application of scanners, OCR software only needs
- The purpose of an OCR recognition system is very simple. It is just to convert the image so that the graphics in the image continue to be stored. If there is a table, the data in the table and the text in the image are all converted into computer text. The amount of storage is reduced, and the recognized text can be reused and analyzed. Of course, the labor and time for keyboard input can also be saved.
- From image to result output, image input, image pre-processing, text feature extraction, comparison recognition, and finally manual correction will correct the wrong text and output the result.
- 1. Setting the resolution is an important prerequisite for text recognition. Generally speaking, the scanner provides more image information, and the recognition software can easily get the recognition result. But it is not that the higher the scanning resolution is set, the higher the recognition accuracy is. Choose 300dpi or 400dpi resolution, suitable for most documents scanning. Pay attention to the scanning and identification of the original text. When setting the scanning resolution, do not exceed the optical resolution of the scanner, or you will lose more than you gain. Here are some typical settings for reference only.
- (1) 200dpi is recommended for paragraphs of characters 1, 2, and 3.
- (2) Articles with 4, 4, and 300 characters are recommended to use 300dpl
- (3) Articles with a size of 5 or 6 words, 400dpl is recommended
- (4) Articles 7 and 8 are recommended to use 600dpi.
- 2. Properly adjust the brightness and contrast values when scanning to make the scanned file clear in black and white. This has the most important effect on the recognition rate. The setting of the scanning brightness and contrast values is based on the observation that the strokes of Chinese characters in the scanned image are thin but not broken. Before recognizing, first look at the quality of the text in the scanned image. If there are black spots or dark spots in the image or the text lines are very thick and very dark, and the strokes cannot be distinguished, the brightness value is too small, and the brightness value should be increased. Try it out; if the lines of the text are uneven, there are broken lines, or even the outline of the Chinese characters in the image is severely broken, it means that the brightness value is too large, you should reduce the brightness and try again.
- 3 Choose your scanning software. Choosing a good OCR software for yourself is the basis for text recognition. Generally, do not use the OEM software that comes with the scanner. OEM OCR software has few functions and poor results, and some do nt even have Chinese recognition.
- Select another image software. First, the OCR software cannot identify all the scanners. Second, it is also the most critical. The image scanned by the scanning interface of the image software is easy to process.
- 4 If the text to be processed is formatted, such as bold, italics, first line indentation, etc., some OCR software will not recognize it, and the format will be lost or garbled. If you must scan formatted text, make sure that the recognition software you use supports scanning in text format. You can also turn off the style recognition system, so that the software can focus on finding the correct characters, regardless of the font and font format.
- 5. When scanning and recognizing newspapers or other translucent documents, the text on the back will confuse the glyphs through the paper, which will cause great obstacles to recognition. When encountering such a scan, just attach it to the back of the scanned original. Cover a piece of black paper and increase the scanning contrast when scanning to reduce the effect of blurred fonts on the back and improve the recognition accuracy.
- 6. Normally, text scanning originals are black and white originals, but the scanning mode is often set to grayscale mode when scanning settings. Especially when the quality of the original is poor, scan in grayscale mode and continue to recognize after the scanning software finishes processing, which will get a better recognition accuracy. It is worth noting that the OCR recognition software can determine the threshold value by itself, and a threshold difference of a few percentage points may affect the normal operation of the recognition. Of course, the resulting image file will be much larger than a black and white file. When scanning a large number of documents, the original must be tested to find the optimal threshold percentage.
- 7. When encountering scanned originals with mixed graphics, first determine whether the recognition software used supports automatic analysis of graphics. If supported, OCR software will automatically calculate the content, position, and order of the text when performing such scan recognition. The text can be recognized normally in the order of marking.
- 8. Selecting the scanning area manually will have better recognition results. After setting the parameters, preview it first, and then start to select the scanning area. Don't choose the article you want to use in a single area, because the current article layout in order to pursue better visual effects, the use of mixed graphics and text, scanning into an image will affect OCR recognition. Therefore, the layout should be divided into N areas according to the actual situation. How to divide the area? The font and size of the text in each area should be the same. There is no graphics or image. The width of each line is the same. If the length is different, subdivide it. Generally, you can scan up to 10 selections at a time. According to different situations, set the order of recognition areas reasonably. Don't think this process is too annoying, but it is an effective way to improve the recognition rate. Note that the recognition areas must not overlap, and everything should be recognized before it is recognized. In this way, the general recognition rate will be above 95%. After proofreading the incorrectly recognized words, you can enter the corresponding word processing software to perform the required processing.
- 9. When placing scanned originals, the scanned text must be placed in the middle of the scan start line to minimize distortion caused by the optical lens. At the same time, the scanner glass should be kept clean and free from damage. The text is tilted at a certain angle, or the original text part is irregularly typeset. You must use the rotation tool after scanning to correct it; otherwise, OCR recognition software will treat horizontal strokes as oblique strokes, and the accuracy of recognition will drop a lot. It is recommended that users place the scanned originals as straight as possible. Rotating the tool to correct them will reduce the image quality and make character recognition more difficult.
- 10 First "preview" the overall layout, select the area to be scanned, and then use the "zoom preview" tool to select a small block to zoom in to the full screen, observe the contrast of the text, the depth of the text, and adjust the "threshold" according to the situation. "The size of the", finally requires the text to be clear, not thick (text grouping), not faint (text broken pen cutting), generally around the "threshold" 80 is appropriate, and finally scan.
- 11. Use tools to wipe off image stains, including illustrations and dividers that do nt need recognition in the original layout, so that there is nothing extra in the text image except the text; this can greatly improve the recognition rate and reduce the modification work after recognition.
- 12. If you want to scan articles with slightly lower printing quality, such as newspapers, the scan results will not be black and white, a large number of black spots will appear, and the phenomenon of sticking will also appear on the strokes of the font. These two items are recognized by Chinese characters. Taboo will seriously affect the accuracy of Chinese character recognition. In order to obtain better recognition results, the hue adjustment must be carefully performed, and repeated scanning multiple times to obtain a more ideal result. In addition, because the newspaper is thin and most of the paper is not high, the upper cover of the scanner cannot fully press the newspaper (there is a gap), so in general, the scanning recognition effect of a newspaper is not as good as that of a magazine. The solution is to press one or two 16K magazines in the newspaper, and the effect is still good.