Programming Assignment 1: Evaluation of Part of Speech Tagging

Due Friday, April 17, 11:59P.M.
No late homework is allowed. Turn in what you have at the due date/time.

 

 

Introduction

This assignment requires you to create a program that builds on the LingPipe Part-of-speech tagging tutorials, allowing a more informative evaluation of the system’s performance. Specifically, you are going to create a precision-recall curve. If you haven’t already, you will need to study the Part of speech tagging tutorial.

 

Precision-recall curves

A precision-recall curve is a tool from information retrieval that gives a visual representation of the tradeoff between accuracy and completeness. A good description is at Hinrich Schuetze's IR book website. In order to compute this curve, we need to  treat the part-of-speech tagging task as a binary decision, We do this by picking on one of the categories (for example, vb) and call it the target tag. What you are going to do is to extract, for every token in the Brown corpus, the probability that the tagger would give to the target tag. You have to adapt the LingPipe code to provide the necessary information.

 

Requirements

Run  LingPipe's RunMedPost  code. You can see the  current (but for our purposes imperfect) behavior by running:


java -cp build/classes:../../../lingpipe-3.7.0.jar RunMedPost ../../models/pos-en-brown.HiddenMarkovModel

 

when your current directory is

 

/path/to/your/lingpipe-3.7.0/demos/tutorial/posTags

 

Notice that the  “CONFIDENCE” section of each output gives 5  possible tags for each token., with probabilities. This is the information that you will need in order to create the precision recall curve.

 

Make a copy (in demos/tutorials/posTags/src) the of RunMedPost.java naming it  RunOSULab1.java (Java’s file naming conventions mean that you will need to change the name of the top level class to RunOSULab1.java too, or it won’t compile). While you’re doing that, adjust the code so that the CONFIDENCE section of each input produces 6 alternative tags for each token rather than 5.  Run  ant in the demos/tutorials/posTags directory to compile your new code. Test it with:

 

java -cp build/classes:../../../lingpipe-3.7.0.jar RunOSULab1 ../../models/pos-en-general-brown.HiddenMarkovModel

 

won’t compile). While you’re doing that, adjust the code so that the CONFIDENCE section of each input produces 6 alternative tags for each token rather than 5.  Run  ant in the demos/tutorials/posTags directory to compile your new code. Test it with:

 

java -cp build/classes:../../../lingpipe-3.7.0.jar RunOSULab1 ../../models/pos-en--brown.HiddenMarkovModel

 

And make sure it does the thing you expect.

 

Next, adjust your code so that it takes input sentences from the Brown corpus rather than from the command line (N.B. you will need the data , which is on the course website, and you will need to work out how to access it).  Section 3 of the tutorial has valuable code which you will be able to  adapt for this purpose. Use the train-a-little tag-a-little paradigm from the tutorial (with 170,000 characters of priming before any tests, as in the tutorial).

 

Finally, adjust your code so that it takes the target tag as a command line argument, and prints out the following things for each test case

 

A header line  (with no spaces) saying:

 

Test_case _n

 

For each token in the test case

 

The word <tab>  The correct tag <tab> The highest scoring tag and its probability <tab> The target tag and its probability <newline>

 

A separator line consisting of just  <newline>

 

This is the final product, a program that the grader can run to produce a precision-recall curve for any desired tag. Use the format from section 3 of the tutorial, with a colon separating tags and their probabilities. The grader will have code that reads the correct tag, and the probabilities for the target tag and calculates the precision/recall curve.

.

 

 

Extra Credit

For fun, here are some extra difficulty items that you can incorporate into your submission

Resources


An online book on Java

What to turn in

  1. A test report showing that you managed to run your program yourself.
  2. A tar or zip file containing your copy of the posTag demo directory from LingPipe with the modifications
  3. and a README file that instructs the grader how to run your program.  You can assume that the grader has a copy of the brown.zip file, but you need to be completely explicit and clear on how the grader is supposed to inform your program where it is.

Make sure that your name appears in all files that you turn in.

I prefer you to turn in your code so that it can be executed/tested on the CSE Unix machines.

How to turn it in

Regardless of where your code runs, turn in your submission electronically using your CSE account. Create a directory containing only the files that you want to turn in. Use the submit command to send the files to the grader. If your directory is called lab1, the syntax of the submit command is:

> submit c732aa lab1 lab1

I recommend keeping the confirmation message that the submit command sends to you as a documentation of the time you submitted.


Last modified:  April 6 2009