This is a deliberately open-ended assignment: you can choose to generate the acoustic model in several ways.
The MFCC input file will be an ascii file of the format:
sent# frame# label C0 C1 C2 .. C12Sentence numbers start at 0. Frame numbers start at 0, and are reset at the beginning of each utterance. The label is the most likely label given a forced alignment of the training data using another recognizer (I gave you this so that you don't have to implement the EM algorithm if you don't want to). C0 through C12 are the MFCC coefficients.
You should train your acoustic model to predict either P(X|Q), where Q is the state (phone) label, if you have a likelihood based model (Gaussians), or P(Q|X) if you have a discriminative model (Neural Network). In the latter case you will probably want to divide by P(Q), which is the prior probability of the phone in the training set, to get a scaled likelihood.
The test data will have a dummy label, since you shouldn't know the "correct" phone before running your acoustic model.
In order to be compatible with the FSMs of the previous homework, your system, after it is trained, should take in a sentence at a time and produce output in the following format (assuming only three possible phones A, B, and C). Note: this is only a schematic representation as an example; every test sentence will have more than three frames of input and have 13 MFCCs. So the output should have NxM arcs, where N is the number of phones and M is the number of frames.
Test input:
0 0 _ 0.34 -2.42 0.35 ... 2.42 0 1 _ 0.14 3.32 0.42 ... 0.42 0 2 _ 0.34 0.32 0.34 ... -2.49Test output
0 1 A -log(P(X0|A)) 0 1 B -log(P(X0|B)) 0 1 C -log(P(X0|C)) 1 2 A -log(P(X1|A)) 1 2 B -log(P(X1|B)) 1 2 C -log(P(X1|C)) 2 3 A -log(P(X2|A)) 2 3 B -log(P(X2|B)) 2 3 C -log(P(X2|C)) 3where X0, X1, and X2 are the MFCCs from lines 0, 1 and 2 respectively in the input. You can then compile the sentence using fsmcompile and then compose with the recognizer as you did in the last homework.
/class/cse794L/fosler/hw3/The file "isoword.female.train.small.ascii" contains 600 training sentence of isolated digits. (Yes, this is small!) If you are going to work on stdsun or one of the lcc/suncc servers, you might just want to symbolically link this file to your own work area to save on disk space:
ln -s /class/cse794L/fosler/hw3/isoword.female.train.small.ascii .I've also included 22 test sentences; the full set is in isoword.test.all.mfc.ascii, but I've also broken them into 1 sentence per file for your convenience in the testdata directory. The correct answers are:
0/1: ONE 2/3: TWO 4/5: THREE 6/7: FOUR 8/9: FIVE 10/11: SIX 12/13: SEVEN 14/15: EIGHT 16/17: NINE 18/19: OH 20/21: ZEROFull credit will be given for implementing one method. Extra style/glory/bonus points will be given for comparing two methods (e.g., Gaussians vs. Mixture of Gaussians).
NOTE: depending on your method, you may not get all of the words right. Be sure to evaluate how good the results are.
When you submit your results, you need to submit two things:
submit c794aa lab3 hw3
Make sure that your writeup includes enough instructions that we will be able to run your fsms easily. That means to tell us what files are what.
Have fun!