How Do I Choose the Best Speech Recognition Software?

Speech recognition is an interdisciplinary subject. In the past two decades, speech recognition technology has made significant progress and has begun to move from the laboratory to the market. It is expected that in the next 10 years, speech recognition technology will enter various fields such as industry, home appliances, communications, automotive electronics, medical, home services, consumer electronics, and so on. The application of speech recognition dictation machine in some fields was rated as one of the ten major events of computer development in 1997 by the American press. Many experts believe that speech recognition technology is one of the ten most important scientific and technological development technologies in the field of information technology from 2000 to 2010. The fields involved in speech recognition technology include: signal processing, pattern recognition, probability and information theory, vocalization mechanism and auditory mechanism, artificial intelligence, and so on.

Speech Recognition

Speech recognition is an interdisciplinary subject. For nearly two decades,
Communicating with the machine and letting the machine understand what you say is something that people have long dreamed of.
In 1952, the Davis and others at the Bell Institute successfully studied the world's first English digits.
According to different recognition objects, speech recognition tasks can be roughly divided into three categories: isolated word recognition, keyword recognition (or keyword spotting), and continuous speech recognition. Among them, the task of isolated word recognition is to recognize previously known isolated words, such as "on" and "off"; the task of continuous speech recognition is to recognize any continuous speech, such as a sentence or a paragraph; continuous speech flow The keyword detection in the target is continuous speech, but it does not recognize all the words, but just detects where several known keywords appear, such as detecting the words "computer" and "world" in a paragraph .
According to
Speech recognition methods are mainly pattern matching methods.
During the training phase, the user speaks each word in the vocabulary in turn, and stores its feature vector as a template in the template library.
In the recognition phase, the feature vector of the input speech is compared with each template in the template library in order to compare the similarity, and the highest similarity is output as the recognition result.
There are five main problems in speech recognition:
Identification and understanding of natural language. First of all, continuous speech must be decomposed into words, phonemes, and other units. Second, a rule for understanding semantics must be established.
Large amount of voice information. The voice mode is not only different for different speakers, but also different for the same speaker. For example, the voice information of a speaker is different when he speaks at will and seriously. The way a person speaks changes over time.
Vagueness of speech. Different words may sound similar when the speaker is speaking. This in English and
Front-end processing refers to processing the original speech before feature extraction to partially eliminate noise and the effects of different speakers, so that the processed signal can better reflect the essential features of speech. The most commonly used front-end processing is endpoint detection and speech enhancement. Endpoint detection refers to distinguishing between speech and non-speech signals in the speech signal and accurately determining the starting point of the speech signal. After the endpoint detection, subsequent processing can be performed only on the voice signal, which plays an important role in improving the accuracy of the model and the accuracy of recognition. The main task of speech enhancement is to eliminate the impact of environmental noise on speech. The current general method is to use Wiener filtering, which is better than other filters in the case of large noise.
The extraction and selection of acoustic features is an important part of speech recognition. The extraction of acoustic features is not only a process of large-scale information compression, but also a process of signal unwrapping. The purpose is to enable the mode divider to better divide. Due to the time-varying nature of speech signals, feature extraction must be performed on a small segment of speech signals, that is, short-term analysis. This section of analysis that is considered to be stationary is called a frame, and the offset between frames is usually 1/2 or 1/3 of the frame length. The signal is usually pre-emphasized to boost high frequencies, and the signal is windowed to avoid the effects of short-term speech segment edges.
Language models are mainly divided into two types: rule models and statistical models. Statistical language models use probabilistic statistics to reveal the inherent statistical laws of language units.
The search in continuous speech recognition is to find a word model sequence to describe the input speech signal, so as to obtain the word decoding sequence. The search is based on scoring the acoustic model and language model in the formula. In actual use, it is often necessary to add a high weight to the language model and set a long word penalty score based on experience.
Viterbi: Based on the dynamic programming of the Viterbi algorithm at each time point, calculate the posterior probability of the decoding state sequence to the observation sequence, keep the path with the largest probability, and
In recent years, especially since 2009, with the development of deep learning research in the field of machine learning and the accumulation of big data corpora, speech recognition technology has developed by leaps and bounds.
1. New technological developments
1) Introduce deep learning research in the field of machine learning to the training of acoustic models for speech recognition. The use of a multilayer neural network with RBM pre-training greatly improves the accuracy of the acoustic model. In this regard, Microsoft researchers took the lead in making breakthrough progress. After using the deep neural network model (DNN), the speech recognition error rate has been reduced by 30%, which is the fastest progress in speech recognition technology in the past 20 years.
2) At present, most mainstream speech recognition decoders have adopted a finite state machine (WFST) -based decoding network. This decoding network can integrate language models, dictionaries, and acoustic shared sound word sets into a large decoding network, which greatly improves decoding. The speed provides a basis for real-time applications of speech recognition.
3) With the rapid development of the Internet and the popularization of mobile terminals such as mobile phones, a large amount of text or speech corpora can currently be obtained from multiple channels, which provides a wealth of training for language models and acoustic models in speech recognition. Resources, making it possible to build universal large-scale language models and acoustic models. In speech recognition, the matching and richness of training data is one of the most important factors to promote the improvement of system performance. However, the annotation and analysis of corpora need long-term accumulation and precipitation. With the advent of the era of big data, large-scale corpus resources Accumulation will refer to strategic heights.
2. New technology applications
Recently, speech recognition is the most popular application on mobile terminals. Voice dialogue robots, voice assistants, and interactive tools have emerged endlessly. Many Internet companies have invested in human, material and financial resources to carry out research and applications in this area. And convenience mode quickly occupied the customer base.
At present, foreign applications have been led by Apple's siri.
On the domestic front, systems such as HKUST Xunfei, Yunzhisheng, Shanda, Jietong Huasheng, Sogou Voice Assistant, Zidong Interpreter, and Baidu Voice all use the latest voice recognition technology, and other related products on the market are also directly or indirectly embedded. Got a similar technique.

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?