PROJECTS AT VISL FINISHED IN 2005



Speech Recognition using SVM

Ronen Karasik
and
Omer Schneider

Instructed by Dori Peleg

VISL laboratory, The Technion.

Abstract

Speech Recognition project was developed as a specified tool for handling computers applications.

Speech Recognition fulfills various aspects in our lives that we cannot imagine, Such as applications for the impaired hearing, voice dialing and voice to text (V2T)  applications.

It seems that not only speech recognition is a luxury, but a need as well. Our goal  is developing a fully integrated I/O system from this basic tool

The application should recognize 12 words concerning voice dialing; the digits 0-9 and the words : "send" and "stop" (to begin and end a call).

This systems is based on GUI (graphical user interface), which controls the basic features.


The problem (background):

Speech Recognition through learning machine faces to the common people a problem.

One may think to himself how can we recognize\classify sound signals and moreover, how can we build a learning machine?

Learning machine is a strong method that can be adjusted to any problem that you may think of.

Sound signals recognition\classification is being adjusted to the learning machine , by "cleaning" the signal and pass it through probabilistic model.

The solution (basic approach):

The solution is divided into 3 steps:

First:

a. Getting signals and processing them in the time domain: calculating the signal's envelope, and truncating the signal, to lose noise samples.

The word "Four" : initial

The word "Four" : envelope

The word "Four" : after truncation

b. Transfer them to the frequency domain and perform LPF on the FFT, in order to lose the frequencies where noise shall appear.

Second:

Modeling the signal into an observation vector.

There are some known methods to create that vector, the methods used in this project are :
* The LPC algorithm (Linear Prediction Coding) which tries to estimate the audio signal, using previous samples.

* The BINS method. windowing the FFT into BINS, where in each, mean and standard deviation are calculated.

We got 20 coefficients from the LPC algorithm and 20 coefficients from the BINS method (10 BINS).

Passing the vector into the learning machine.

in this Project the SVM algorithm (Support Vector Machine) was used to create the learning machine.

SVM tries to divide the space (of order N) into classes, with maximal margins between classes.

 2 class example .Linearly divided.


In practice, dividing the plane is much harder and cannot be performed perfectly and we shall allow classification errors

Allowing  errors.

More over, sometimes a non-linear classifier is needed:

A case in which a non-linear classifier is needed.

We tried three classifier kernels : Linear, Polynomial, Radial (RBF).
After these 3 preliminary steps, there is a complete system which is ready for classifying input voice signals.

Tools

Performance

The three classifiers where checked on one speaker and two speakers, on both methods.

 

Radial

Polynomial

Linear

Reps per Construction\
Num. of  Speakers

48.8333
1.6596

25.4667
0.5963

25.9767
0.5445

5 Reps \ 1 Speaker

38.75
0.5501

12.35
0.4894

11.85
0.6708

10 Reps \ 1 Speaker

26.7
3.2904

5.23
0.256

4.750
1.1180

18 Reps \ 1 Speaker

36.8
0.199

27.5
1.356

20.275
4.9961

20 Reps \  2 Speakers

27.8
0.5606

16.33
1.7029

21.77
3.8346

30 Reps \ 2 Speakers

23
2.5355

11.5
2.4152

9.35
6.6914

38 Reps \ 2 Speakers

Conclusions

Through the work over the project we came to a few conclusions:

a. Despite of  its theoretical poorness, the Linear classifier performed better
    than the other two classifiers, because he doesn't over-fit to the training set.

 b. The project success of learning system is affected by :

    Dictionary size : At first we had a dictionary size of 100 words.
    This proved to be very hard to implement, so it was lowered to 12 words.

    Samples: as more samples of the words were given, the classification got
    better.

    Quality of record: the better the environment is, the better the
    classification is.

    Number of speakers: The system supports one or two classifiers. The more
    classifiers, the higher the  computational complexity gets.

    Coefficients handling: deriving the coefficients form the signal is the most
    crucial stage of the system.

c. It is important to compare other methods of creating comparable
   vectors, such as HMM (Hidden Markov Models) and hybrid systems, for
   better performance.

d. For real time system, the recording environment should be as much as
    noise-free as possible.

e. The dictionary should be expanded to support more applications and
    needs, while trying not to damage the performance badly.

Acknowledgments

We are grateful to our project instructor Dori Peleg for his help and guidance throughout this work, and Lab Supervisor Johanan Erez.

FULL DOCUMENTATION