Homework 3, due MONDAY 6 February 2005 , 11:59 PM

This homework builds on Problem 2 from Homework 1, using the AT&T FSM toolkit. Instead of receiving the output of a neural net, you will get MFCC coefficients for every frame. Your job is to train an acoustic model on these MFCC coefficients and then test on a separate set.

This is a deliberately open-ended assignment: you can choose to generate the acoustic model in several ways.

The single-density Gaussians (i.e. no mixtures) are likely to be the easist to do (and may take only a short amount of time in MATLAB). For the other techniques you can reuse the back-end of the recognizer (duration/pronunciation/langage model) from Homework 1. If you had trouble with that see me and we'll fix that up.

The MFCC input file will be an ascii file of the format:

sent# frame# label C0 C1 C2 .. C12
Sentence numbers start at 0. Frame numbers start at 0, and are reset at the beginning of each utterance. The label is the most likely label given a forced alignment of the training data using another recognizer (I gave you this so that you don't have to implement the EM algorithm if you don't want to). C0 through C12 are the MFCC coefficients.

You should train your acoustic model to predict either P(X|Q), where Q is the state (phone) label, if you have a likelihood based model (Gaussians), or P(Q|X) if you have a discriminative model (Neural Network). In the latter case you will probably want to divide by P(Q), which is the prior probability of the phone in the training set, to get a scaled likelihood.

The test data will have a dummy label, since you shouldn't know the "correct" phone before running your acoustic model.

In order to be compatible with the FSMs of the previous homework, your system, after it is trained, should take in a sentence at a time and produce output in the following format (assuming only three possible phones A, B, and C). Note: this is only a schematic representation as an example; every test sentence will have more than three frames of input and have 13 MFCCs. So the output should have NxM arcs, where N is the number of phones and M is the number of frames.

Test input:

 
0 0 _ 0.34 -2.42 0.35 ... 2.42
0 1 _ 0.14 3.32 0.42 ... 0.42
0 2 _ 0.34 0.32 0.34 ... -2.49
Test output
0 1 A  -log(P(X0|A))
0 1 B  -log(P(X0|B))
0 1 C  -log(P(X0|C))
1 2 A  -log(P(X1|A))
1 2 B  -log(P(X1|B))
1 2 C  -log(P(X1|C))
2 3 A  -log(P(X2|A))
2 3 B  -log(P(X2|B))
2 3 C  -log(P(X2|C))
3
where X0, X1, and X2 are the MFCCs from lines 0, 1 and 2 respectively in the input. You can then compile the sentence using fsmcompile and then compose with the recognizer as you did in the last homework.

Files

The training and test data is on stdsun (not the webserver), and can be found in
/class/cse794L/fosler/hw3/
The file "isoword.female.train.small.ascii" contains 600 training sentence of isolated digits. (Yes, this is small!) If you are going to work on stdsun or one of the lcc/suncc servers, you might just want to symbolically link this file to your own work area to save on disk space:
ln -s /class/cse794L/fosler/hw3/isoword.female.train.small.ascii .
I've also included 22 test sentences; the full set is in isoword.test.all.mfc.ascii, but I've also broken them into 1 sentence per file for your convenience in the testdata directory. The correct answers are:
0/1: ONE
2/3: TWO
4/5: THREE
6/7: FOUR
8/9: FIVE
10/11: SIX
12/13: SEVEN
14/15: EIGHT
16/17: NINE
18/19: OH
20/21: ZERO
Full credit will be given for implementing one method. Extra style/glory/bonus points will be given for comparing two methods (e.g., Gaussians vs. Mixture of Gaussians).

NOTE: depending on your method, you may not get all of the words right. Be sure to evaluate how good the results are.

When you submit your results, you need to submit two things:


Submission instructions:

Write up all of your answers to the questions in a text editor so that it can be submitted electronically (txt files preferred). Put that file as well as your fsm files (preferrably in separate subdirectories for each problem) in a directory called hw3, and use the submit command to send the files to the grader. The syntax of the submit command is:

submit c794aa lab3 hw3

Make sure that your writeup includes enough instructions that we will be able to run your fsms easily. That means to tell us what files are what.

Have fun!