Homework 5, due WEDNESDAY 22 February 2006, 11:59 PM

In this homework, you will get to experiment with lattice rescoring and language models. The baseline experiment will be to train a unigram, bigram, and trigram model on the Wall Street Journal (WSJ) data, as we did in class, and then use them to recognize hypotheses in lattices. I have put the WSJtrain file into the files directory; these are slightly modified to be compatible with the lattices in that they all have !SENT_START and !SENT_END on each sentence.

A lattice is just like a finite state grammar, with nodes and links, except that it is a representation of the acoustic hypothesis space. The lattices in the files directory are known as either SLF files (Standard Lattice Format) or, more colloquially, as HTK-format lattices. These particular lattices were generated by Soundar Srinivasan (thanks Soundar!) using a system trained on read Wall Street Journal Sentences.

The number of nodes is given by N=xxx, and number of links is given by L=xxx. Every node is labeled with I=xx, and each node can only have one word (W=xxx) leaving from the node (over several links) and starts at time t=x.xx. Every link goes from a start node to an end node and has an acoustic score (labeled with a=), as well as the original language model score (l=). The latter are disregarded when rescoring with a different language model.

Rather than use the FSM toolkit, we will use the SRILM toolkit to rescore these lattices. In order to decode the lattices, you can use the following incantation:

lattice-tool -read-htk -in-lattice [latticefile] -viterbi-decode -lm [lmfile]
You can also put the lattice file names into a list file and run:
lattice-tool -read-htk -in-lattice-list [latticelist] -viterbi-decode -lm [lmfile]
You will notice that the output has the !SENT_START and !SENT_END markers. Running the script "cleanup.pl" on the output will eliminate these tokens.

You can score the files using

wordscore -r answers -w [outputfile]
This will give the overall word error rate. Using "-v" will give sentence-by-sentence results.

As in the previous homeworks, all of the executable programs are given in

/class/cse794L/fosler/bin.solaris
/class/cse794L/fosler/bin.linux
You will also need a copy of the tutorial from Thursday in order to figure out how to do things like perplexity calculation using SRILM.

Assignment

The minimal requirements for this assignment are:

Part 1: train a unigram, bigram, and trigram LM on the WSJtrain corpus. Record the exact commands used so that the TA can replicate if necessary.
Part 2: evaluate the perplexity on WSJtest corpus.
Part 3: decode the lattices given (the order of the answers is given in order). Give the word error rate for each LM. Why are they different?
Part 4: for each lattice result, evaluate the perplexity of the recognized answers
Part 5: use the "verbose" options to wordscore (-v) and ngram (-debug 2) to get the word error rates and per-word probabilities. What do you notice about the relationship between LM scores and word errors?

You can get extra credit by trying different ngram training options (e.g., different backoff/smoothing techniques, different counts), and reporting your observations.

Files

The training and test data is on stdsun (not the webserver), and can be found in
/class/cse794L/fosler/hw5
The lattices all end in *.lat, and are in the same directory.

When you submit your results, just submit the answers to the above questions (LM is not necessary). Remember to give the training commands so that the TA can debug if your answer does not match hers.


Submission instructions:

Write up all of your answers to the questions in a text editor so that it can be submitted electronically (txt files preferred). Put that file as well as your fsm files (preferrably in separate subdirectories for each problem) in a directory called hw5, and use the submit command to send the files to the grader. The syntax of the submit command is:

submit c794aa lab5 hw5

Make sure that your writeup includes enough instructions that we will be able to run your fsms easily. That means to tell us what files are what.

Have fun!