What Is Dynamic Time Warping?

A correct pronunciation should include all the phonemes that make up the pronunciation and the correct phoneme connection order. The duration of each phoneme is related to the phoneme itself and the condition of the speaker. In order to improve the recognition rate and overcome the difference in the length of time when the same sound is pronounced, the input voice signal is extended or shortened until it is consistent with the length of the standard pattern. This process is called time regularization.

Time series is a common representation of data in most disciplines. For time series processing, a common task is to compare the similarity of two sequences. In time series, the length of two time series that need to be compared may not be equal, and in the field of speech recognition, different people speak at different speeds.
The speech signal has strong randomness, different pronunciation habits, different environments and different moods will lead to different durations of pronunciation. For example, the last sound of the word is accompanied by some dragging sounds, or a little breathing sound. At this time, because the dragging sound or breathing sound is mistaken for a phoneme, the endpoint detection of the word is inaccurate, and the characteristic parameters are changed, thereby affecting Measurement estimation reduces recognition rate.
In isolated word speech recognition, the simplest and most effective method is to use the Dynamic Time Warping algorithm. This algorithm is based on the idea of dynamic programming (DP) and solves the problem of template matching with different pronunciation lengths. It is an earlier and more classic algorithm in speech recognition and is used for isolated word recognition. [1]
Dynamic time warping was proposed by the Japanese scholar Itakura in the 1960s. It stretches or shortens (compands) the unknown until it matches the length of the reference template. In the process, the time axis of the unknown word will be distorted or bent. So that its feature quantity corresponds to the standard mode [1]
Assume that the standard template R is the letters ABCDEF (6), and the test template T is 1234 (4). The distances between the elements in R and T have been given. as follows:
Assume an isolated word (word) speech recognition system that uses template matching for recognition. At this time, the whole word is generally used as the recognition unit. In the training phase, the user speaks each word in the vocabulary once, extracts features, and uses it as a template to store in the template library. In the recognition phase, for a new word to be recognized, the features are also extracted, and then the DTW algorithm is used to match each template in the template library to calculate the distance. Find the shortest distance, which is the most similar one, is the recognized word [1] .

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?