English Japanese
Hayamizu-Tamura Laboratory 
Gifu University Faculty of Engineering
Hayamizu - Tamura Laboratory

Multi-Modal

 Multi-modal Group is conducting research on multi-modal information processing using a combination of information processing technologies such as audio processing and image processing. Multi-modal processing can be expected to improve accuracy compared to uni-modal processing.

 In order to improve the accuracy of speech processing such as speech recognition, robustness under noisy conditions and in real environments is indispensable. Therefore, we are researching a method to perform robust speech information processing that is robust against noise by using image information (around the lips) that is not affected by speech noise.

 The technology of processing using information that integrates both audio information and image information is generally called “bi-modal” or “multi-modal” information processing. This group mainly conducts researches on multimodal speech recognition.

Speech recognition

 Speech recognition is a technology that converts semantic information which is contained in human voice input to a computer, into a character string. With current speech recognition technology, simple speech content can provide nearly 100% recognition performance for speech read out in a quiet environment. On the other hand, the recognition performance under the conditions of noisy environments or wild utterances where humans talk with each other is still insufficient. Therefore, we are researching to achieve higher recognition performance.

Multi-modal speech recognition

 In this group, we are studying a multi-modal speech recognition method that uses both speech information and lip moving images during the speech. This technique can be expected to improve speech recognition performance even in noisy environments. Intuitively, image recognition using lip images complements parts that are difficult with speech recognition. The method of predicting speech by image recognition is called “lip reading”.

1.Recognition model

 We research for construction a model that recognizes the utterance content. In recent years, recognition models have often been constructed with deep speech/image recognition (lip reading) algorithms, which constructed with the knowledge of deep learning.

2.Feature and multimodal integration

 In order to achieve higher performance speech recognition, we are studying image features that take utterance information into account and effective methods for integrating speech and images. For voice features, MFCC (Mel Frequency Cepstral Coefficient) is used. For image features, DBNF (Deep Bottle Neck Feature, by deep learning) is used.

3.Large vocabulary multi-modal speech recognition

 There are many corpora of multi-modal speech recognition tasks for digit utterance tasks. However, there are few corpora available for large vocabulary multi-modal speech recognition. Therefore, this group is also conducting research on large vocabulary multi-modal speech recognition such as corpus construction.

4.Real-time speech recognition

 Using the technology built by this group, we are building a multi-modal speech recognition system capable of real-time processing.
In this research field, we are also performing multi-modal speech recognition using distance image information using Kinect.