This assignment requires you to create a program that builds on the LingPipe Part-of-speech tagging tutorials, allowing a more informative evaluation of the system’s performance. Specifically, you are going to create a precision-recall curve. If you haven’t already, you will need to study the Part of speech tagging tutorial.
A precision-recall curve is a tool from information retrieval that gives a visual representation of the tradeoff between accuracy and completeness. A good description is at Hinrich Schuetze's IR book website. In order to compute this curve, we need to treat the part-of-speech tagging task as a binary decision, We do this by picking on one of the categories (for example, vb) and call it the target tag. What you are going to do is to extract, for every token in the Brown corpus, the probability that the tagger would give to the target tag. You have to adapt the LingPipe code to provide the necessary information.
Run LingPipe's RunMedPost code. You can see the current (but for our purposes imperfect) behavior by running:
java -cp
build/classes:../../../lingpipe-3.7.0.jar RunMedPost
../../models/pos-en-brown.HiddenMarkovModel
when your current directory is
/path/to/your/lingpipe-3.7.0/demos/tutorial/posTags
Notice that the “CONFIDENCE” section of each output gives 5 possible tags for each token., with probabilities. This is the information that you will need in order to create the precision recall curve.
Make a copy (in demos/tutorials/posTags/src) the of RunMedPost.java naming it RunOSULab1.java (Java’s file naming conventions mean that you will need to change the name of the top level class to RunOSULab1.java too, or it won’t compile). While you’re doing that, adjust the code so that the CONFIDENCE section of each input produces 6 alternative tags for each token rather than 5. Run ant in the demos/tutorials/posTags directory to compile your new code. Test it with:
java -cp
build/classes:../../../lingpipe-3.7.0.jar
RunOSULab1 ../../models/pos-en-general-brown.HiddenMarkovModel
won’t compile). While you’re doing that, adjust the code so that the CONFIDENCE section of each input produces 6 alternative tags for each token rather than 5. Run ant in the demos/tutorials/posTags directory to compile your new code. Test it with:
java
-cp build/classes:../../../lingpipe-3.7.0.jar RunOSULab1
../../models/pos-en--brown.HiddenMarkovModel
And make sure it does the thing you expect.
Next, adjust your code so that it takes input sentences from the Brown corpus rather than from the command line (N.B. you will need the data , which is on the course website, and you will need to work out how to access it). Section 3 of the tutorial has valuable code which you will be able to adapt for this purpose. Use the train-a-little tag-a-little paradigm from the tutorial (with 170,000 characters of priming before any tests, as in the tutorial).
Finally, adjust your code so that it takes the target tag as a command line argument, and prints out the following things for each test case
A header line (with no spaces) saying:
Test_case _n
For each token in the test case
The word <tab> The correct tag <tab> The highest scoring tag and its probability <tab> The target tag and its probability <newline>
A separator line consisting of just <newline>
This is the final product, a program that the grader can run to produce a precision-recall curve for any desired tag. Use the format from section 3 of the tutorial, with a colon separating tags and their probabilities. The grader will have code that reads the correct tag, and the probabilities for the target tag and calculates the precision/recall curve.
.
For fun, here are some extra difficulty items that you can incorporate into your submission
Make sure that your name appears in all files that you turn in.
I prefer you to turn in your code so that it can be executed/tested on the CSE Unix machines.
Regardless of where your code runs, turn in your submission electronically using your CSE account. Create a directory containing only the files that you want to turn in. Use the submit command to send the files to the grader. If your directory is called lab1, the syntax of the submit command is:
> submit c732aa lab1 lab1
I recommend keeping the confirmation message that the submit command sends to you as a documentation of the time you submitted.
Last modified: April 6 2009