I am currently writing a Ph.D in Computer Science in the PAROLE work group of LORIA research institute in France.
My dissertation is dealing with automatic speech recognition robustness issues and is due in October 2004.
My researches are supervised by Irina Illina, Dominique Fohr and
Jean-Paul Haton.
Robust Automatic Speech Recognition
An automatic speech recognition (ASR) system gives a significant degradation in performances when used in a test condition that does not match its training environment. This mismatch is due mostly to additional noise sources and discrepancies in channels and speakers.
Those mismatch sources may be non-stationary and little a priori information about them is available.
Several techniques have been proposed to enhance speech in a robust manner. Those techniques generally fall into three broad categories. In the first class, robust signal processing is used to reduce the sensitivity of the speech features with regards to possible distortions. In the second set, models of the noise and channel are directly incorporated into the recognition process. In the third set of approaches, compensation methods modify the feature vectors of the testing signal closer to the trained models. The algorithms studied in my researches fall into this last category. More specifically, our approach can be classified in the Stochastic Matching (SM) framework.
The fundamentals of SM were proposed by Chin-Hui Lee and Ananth Sankar in 1996. In their papers, the parameters of a compensation function are estimated so as to maximize the likelihood of the transformed speech sequence given the set of acoustic models. The parameters are obtained consequently by several Estimation-Maximization steps and naturally rely on the optimal sequence of states (i.e. recognition product).
The most interesting aspect of SM framework is that it does not need any a priori information on the nature or level of the corrupting noise. Theoretically, only the test sentence to be decoded is needed to perform compensation.
Frame synchronous algorithms are naturally appealing to cope with non-stationary slowly varying noise sources even if they often face convergence problems linked to the scarcity of data. Offline compensation algorithms exist and cope with this sort of naturally varying acoustic environment, but the duration of the computation process involved is not compatible with everyday life applications. Our techniques are totally frame-synchronous: the parameters of the compensation functions are updated at each time frame, in parallel with the recognition process.
Frame Synchronous Compensation Algorithm
In the frame synchronous compensation mode, complete statistics (\textit{forward-backward} probabilities) needed in the classical SM framework are difficult to obtain because
the end of sentence is not available. One solution, is
to approximate these statistics by \textit{forward probabilities}.
The basic idea of our method is as follows.
First, the hypothesis is made that during the Viterbi alignment, the states
linked to the highest \textit{forward} probabilities give a good modelisation of the
speech observations
[barreaud03b].
Then, the parameters of the mismatch function are estimated in order to enhance the
likelihood of the observation given those states.
Consequently, this on-line algorithm performs compensation in parallel with recognition
and does not need any \textit{a priori} information on the nature of the noise.
Hence, the parameters of the compensation transform are estimated frame per frame.
Compared to classical frame-synchronous compensation methods such as Cepstral Mean Normalization or Spectral Subtraction, our algorithms gave significant results.
For example, the first version of our algorithm gave up to 15.5% improvement in word error rate over Spectral Subtraction on VODIS database.
The French database VODIS (Voice-Operated Driver Information System) have been recorded in a moving car in various driving situation by 200 speakers.
Similarly, 27.8% improvement over frame-synchronous Cepstral Mean Normalization were obtained.
Structure-based compensation
To improve the results of the previously presented method, we proposed a structural state-based
transformation
[barreaud03c].
This approach is motivated by several observations.
First, it is often assumed that observations which are similar
will be affected in a similar manner by variations in the environment.
Hence, a set of subspace-specific transformations should give better results.
Second, subspace-specific transformations face a data scarcity
problem that can be overcome by the use of hierarchical transformation:
a tree of transformations.
For each node of this tree, a transformation function is estimated according to the observations
of the current sentence.
If the transformation associated with a node is poorly estimated, its
parent will be used.
On-line Compensation for Non-Stationary Noise
As a second step, we explored the possibility for our algorithm to cope with
abruptly changing acoustic environment
[barreaud03d].
In real life environments, ASR systems might face unexpected and sudden occurrence of noise
(for example, opening a window while driving).
No information is available on the occurring time, the level and the nature of
the sudden noise.
A compensation algorithm should take into account such changes in a
short time period.
In this scope, two problems can be explored: detection of environment changes and adaptation of compensation strategy to this new environment.
Consequently, we studied a new version of the previously presented
algorithm. This new version takes into account the abrupt changes in the environment.
At each time frame the distance between the incoming speech frame and the most probable emitting state is computed.
When an abrupt change occurs in noisy environment, this distance changes quickly.
We detect this change using several widely known detection algorithms such as the Shewart control charts detection algorithm, Bayesian information criterion (BIC) and an adaptation of the Spectral Variation Function (SVF).
After this, the bias is set to a value corresponding to the closest environment previously observed.
This approach gave impressive improvements over classical compensation
methods when used on artificially corrupted data (noise added from the middle of a clean test sentence to its end).
For instance, we obtained up 32.4% phoneme error rate improvement over baseline on this type of data.