What Is Speech Processing?

Speech signal processing is a general term for various processing technologies used to study the process of speech sounding, statistical characteristics of speech signals, automatic speech recognition, machine synthesis, and speech perception. Because modern audio processing technology is based on digital calculations and implemented by microprocessors, signal processors, or general-purpose computers, it is also called digital voice signal processing.

Speech signal processing is a general term for various processing technologies used to study the process of speech sounding, statistical characteristics of speech signals, automatic speech recognition, machine synthesis, and speech perception. Because modern audio processing technology is based on digital calculations and implemented by microprocessors, signal processors, or general-purpose computers, it is also called digital voice signal processing.
Chinese name
Speech processing
Foreign name
speech signal processing
Applied discipline
Communication

Speech processing definition

The study of speech signal processing originated from the simulation of vocal organs. 1939 United States H. Dudley exhibited a simple pronunciation process simulation system, which later developed into a digital model of the soundtrack. This model can be used to analyze various frequency spectrums and parameters of speech signals, and to conduct research on communication coding or data compression. At the same time, it can also synthesize speech signals based on the spectrum characteristics or parameter changes obtained by the analysis to achieve machine speech synthesis. The use of speech analysis technology can also realize automatic recognition of speech and automatic recognition of speakers. If combined with artificial intelligence technology, it can also realize automatic recognition of various sentences and automatic understanding of language, thereby realizing a human-computer voice interactive response system. , Really gives the computer an auditory function.
Language information is mainly contained in the parameters of the speech signal, so accurately and quickly extracting the parameters of the speech signal is the key to speech signal processing. Commonly used voice signal parameters include formant amplitude, frequency and bandwidth, tone and noise, and discrimination of noise. Later, parameters such as linear prediction coefficient, channel reflection coefficient and cepstrum parameters were proposed. These parameters only reflect some average characteristics in the pronunciation process, and the actual language pronunciation changes quite rapidly and needs to be described by a non-stationary random process. Therefore, after the 1980s, methods for analyzing non-stationary parameters of speech signals developed rapidly People have proposed a set of fast algorithms, and a new algorithm that uses optimization laws to synthesize the statistical analysis parameters of signals, which has achieved good results.
When the speech processing develops toward practicality, many algorithms are found to be less resistant to environmental interference. Therefore, maintaining the ability of speech signal processing in a noisy environment has become an important issue. This has promoted research on speech enhancement. Some algorithms with anti-interference have appeared one after another. At present, voice signal processing is increasingly closely integrated with the research of intelligent computing technology and intelligent robots, and has become an important branch of intelligent information technology.
Speech signal processing is a comprehensive multidisciplinary technology. It is based on basic experiments in physiology, psychology, language, and acoustics, and is guided by theories of information theory, cybernetics, and system theory. It has developed into a new discipline by applying modern technical means such as signal processing, statistical analysis, and pattern recognition. Speech Analysis, Synthesis, and Perception (J.L. Flanagan), 1965, Linear Prediction of Speech Signals, 1976 (J.Q. Makoul and A.H. Gray), and 1978 Digital Processing of Speech Signals Rabiner and R. W. (Shafer) and other textbooks more comprehensively reflect some basic theories, methods and results of this discipline. The "Summarization of Experimental Phonology" edited by Chinese scholars Wu Zongji and Lin Maocan, gives detailed experimental research methods and aspects from the physical basis, physiological basis, psychological basis of phonological perception, and vowel, consonant, and tone characteristics data. The study of the cochlea of the auditory organs began in the late 1980s, which provided a basis for the study of nonlinear speech processing methods. The rapid development of high-speed signal processors and the successful research of neural network simulation chips have created material conditions for real-time speech processing systems, enabling a large number of speech processing technologies to be applied to many departments such as production and national defense.
Voice signal processing has a wide range of applications in the communications and defense sectors. Various frequency response correction and compensation techniques studied to improve the quality of speech signals in communications, data coding and compression techniques to improve efficiency, and noise cancellation and interference suppression techniques to improve communication conditions are related to speech Dealing is closely related. In the defense communications and command departments, the application of voice processing can realize voiceband confidential communication under various communication conditions, integrated communication of voice and data in computer networks, and in noisy environments (for example, high performance fighters, helicopter environments, and Battlefield command post, etc.) speech recognition devices, noise cancellation devices to overcome strong interference affecting speech degradation, speaker recognition and speaker verification, and interactive speech recognition / synthesis interfaces for advanced air traffic control Are all important parts of modern command automation. Application of speech processing in the financial sector, began to use speaker recognition and speech recognition to achieve automatic deposit and withdrawal services based on user's voice. In instrumentation and control automation production, voice synthesis is used to read out measurement data and fault warnings. With the development of speech processing technology, it can be expected that it will be applied in more departments.
Although the research on speech processing has gone through nearly 50 years of history and achieved many results, it still faces a series of practical problems in theory and method. For example, in terms of coding technology, medium-speed coding can already provide satisfactory communication sound quality. Can low-speed coding also break through the limitations of low communication sound quality and meet the requirements for telephone sound quality? In terms of recognition, there is no reliable method for the segmentation of continuous speech, recognition of large-volume speech, and recognition of anyone's speech. In terms of speech understanding, there is no unified calculation method for qualitative description and quantitative estimation of semantic information. These are important directions for future research.
Figure 1 Schematic diagram of speech recognition technology

Speech processing speech comprehension

Speech understanding uses artificial intelligence technologies such as knowledge expression and organization for automatic sentence recognition and semantic understanding. The main difference from speech recognition is the full use of grammar and semantic knowledge.
Speech understanding originated in the United States. In 1971, the American Perspective Research Projects Agency (ARPA) funded a huge research project. The goal of this project is called speech understanding system. Because people have extensive knowledge of speech and can have a certain foresight about what to say, people have the ability to perceive and analyze speech. Relying on people's extensive knowledge of language and content, and using knowledge to improve computers' ability to understand language is the core of speech understanding research.
The ability to understand can improve the performance of the system: can eliminate noise and noisy sound; can understand the meaning of the context and can use it to correct errors and clarify uncertain semantics; can deal with grammatical or incomplete sentences. Therefore, the purpose of studying speech comprehension can be said to be more effective than studying the system to recognize each word carefully.
In addition to the parts required for the original speech recognition, a speech understanding system must also add a knowledge processing part. Knowledge processing includes automatic collection of knowledge, formation of knowledge base, reasoning and testing of knowledge, etc. Of course, I also hope to have the ability to make knowledge corrections automatically. Therefore, speech understanding can be considered as the product of the combination of signal processing and knowledge processing. Phonetic knowledge includes phoneme knowledge, phonetic knowledge, prosody knowledge, lexical knowledge, syntax knowledge, semantic knowledge, and pragmatic knowledge. This knowledge involves many interdisciplinary subjects such as experimental phonetics, Chinese grammar, natural language understanding, and knowledge search.
The initial development of a speech understanding system is called the HEARSAY system. It uses a common "blackboard" as a knowledge base. Surrounding this blackboard is a series of expert systems that extract and search for various knowledge about phonemes, phonetic changes, etc ... The system that can further achieve the expected goal in the future is the HARPY system. This system uses a finite state model of language to gather various knowledge sources separated from each other through a single unified network. This unified network is called a knowledge compiler. Different understanding systems have different characteristics in terms of strategies or organization for utilizing knowledge.
A perfect speech understanding system is the research ideal that people dream of, but this is not a research topic that can be completely solved in the short term. However, the task-oriented speech understanding system, for example, involves only a limited vocabulary, a speech understanding system with generally speaking sentence patterns, and a speech understanding system that can be used by a range of staff. Therefore, it has practical value in certain areas of automation applications, such as airline ticket pre-sale systems, banking, hotel registration and inquiry systems.

Speech processing speech recognition

Speech recognition A general term for a technology that uses a computer to automatically recognize phonemes, syllables, or words of a speech signal. Speech recognition is the basis for realizing automatic speech control.
Speech recognition originated from the "dictation typewriter" dream of the 1950s. After mastering the problem of the change of formants of vowels and the acoustic characteristics of consonants, scientists believe that the process from speech to text can be realized by machines, that is, they can put Ordinary pronunciation is converted into written text. The theoretical research of speech recognition has been for more than 40 years, but it has been transferred to practical applications after the development of digital technology and integrated circuit technology, and many practical results have now been achieved.
Speech recognition generally goes through the following steps: Speech preprocessing, including the normalization of the amplitude of the speech, frequency response correction, framing, windowing, and start and end point detection. Analysis of speech acoustic parameters, including analysis of speech formant frequency, amplitude and other parameters, as well as analysis of speech linear prediction parameters and cepstrum parameters. Parameter nominalization is mainly the nominalization on the time axis. Commonly used methods include dynamic time warping (DTW) or dynamic programming method (DP). Pattern matching can use distance criteria or probability rules, or syntactic classification. Recognition decision, the recognition result is given by the final discrimination function.
Speech recognition can be classified according to different recognition content: phoneme recognition, syllable recognition, word or phrase recognition; it can also be classified according to vocabulary: small vocabulary (less than 50 words), middle word (50 ~ 500 words) ), Large vocabulary (more than 500 words) and very large vocabulary (tens to tens of thousands of words). Classification according to pronunciation characteristics: It can be divided into the recognition of isolated sounds, connected sounds and continuous sounds. Classification according to the requirements of the speaker: recognized person identification, that is, only a specific speaker identification, and non-identified person identification, that is, no one can identify who the speaker is. Obviously, the most difficult speech recognition is speech recognition with large vocabulary, continuous sound and incognito.

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?