CSE630 Homework 5: Uncertainty & Classification

Due: Must be submitted by November 21 at 04:00
1 point late penalty for each minute late

Submit instructions:

Write up all of your answers in a text editor so that it can be submitted electronically, and make sure any programs you submit are runnable on the CSE dept unix computers. Put all of the files you want to submit in a directory on STDSUN, and submit it using the command:

> submit c630aa lab5 (lab5_dir)

Where lab5_dir is the directory containing the files you want to submit.

  1. (25 points) You have a twelve sided die (with sides numbered 1-12) and a four sided (pyramid) die (with sides numbered 1-4).
    1. What is the probability of rolling a 3 on the 12-sided die?
    2. What is the probability of rolling a 3 or greater on the 4-sided die?
    3. Rolling the two dice together, what is the probability of rolling a 5 (the total of the two dice)?
    Now, a variant on this theme: the game now happens in two phases: you roll the 12-sided die, giving you score R1. If R1<4, then you roll the 12-sided die again (giving R2). If R1>3, then you roll the 4-sided die to get R2. The total score is R1+R2.
    1. What is the probability of scoring an 8?
    2. What is the probability of scoring a 6?

  2. (35 points) In baseball, two elements that contribute to whether you win or lose a game is whether you have good pitching and whether you have good batting. Consider the following table, which looks at the statistics for major league baseball games, presented as the joint probability of winning, pitching, and batting. To derive this data, each team was labeled as having "GoodPitching" or "BadPitching" and "GoodBatting" or "BadBatting". (In case you're interested, we derived this by looking at the ERA and batting average of the teams; half of the teams in the ranking were labeled "good" and half were "bad").

    Win Loses
    GoodPitching BadPitching GoodPitching BadPitching
    GoodBatting 0.156 0.114 0.116 0.115
    BadBatting 0.117 0.113 0.118 0.151

    1. What is the probability of a team having good pitching, bad batting, and loosing a game? Write the answer as P(...)=###, where you fill in the ... and ### appropriately.
    2. What is the probability of a team loosing a game given that they have good pitching and bad batting? WARNING: this is not the same as the above question. Write the answer as P(...)=###, where you fill in the ... and ### appropriately.
    3. From the chart, it appears that the teams have almost an equal chance of winning no matter how they play. Is it true that as long as you don't have GoodPitching and GoodBatting, your chances of winning are about the same (the chance of winning given the other three combinations of values)? Explain why or why not, supporting your answers with probabilities.

  3. (40 points) Spam filtering:
    Read this wikipedia article on Bayesian reasoning for Spam filtering. Use the dataset of spam emails and Enron emails (non-spam) to train your spam filtering process. The enron emails are available separately in individual directories, or I concatenated them together in the file all-nonspam-cat.txt, so use that file if you have room to store a file that big, otherwise you can use the individual files to make a smaller corpus. After training it from the training data, use this test data to test your classifier. There are around 30 spam emails and 30 nonspam emails to classify.

    To create the classifier, you will need to create a program that tokenizes the input data into word-like units and counts the occurrence of each token in the spam and non-spam categories. Your tokenizer doesn't have to be very accurate, just use whitespace or punctuation to tokenize. You may want to differentiate header/subject/body tokens.

    For 20 points of credit, describe the process in a text writeup, including pseudocode for the training and classification processes, and walk through the classification process on a tiny sample text with only 10 tokens. To receive full credit, turn in a README file and a working program that we can test new input files against with a command-line execution, and report the Precision and Recall of your classifier on the test examples. See the slides from week9 for the equations for Precision and Recall. Your working program should expect an input file containing one message to be classified.
    > spamFilter < newtestfile.txt
    classification: THIS IS SPAM