> submit c630aa lab5 (lab5_dir)
Where lab5_dir is the directory containing the files you want to submit.
| Win | Loses | |||
| GoodPitching | BadPitching | GoodPitching | BadPitching | |
| GoodBatting | 0.156 | 0.114 | 0.116 | 0.115 |
| BadBatting | 0.117 | 0.113 | 0.118 | 0.151 |
To create the classifier, you will need to create a program that tokenizes the input data into word-like units and counts the occurrence of each token in the spam and non-spam categories. Your tokenizer doesn't have to be very accurate, just use whitespace or punctuation to tokenize. You may want to differentiate header/subject/body tokens.
For 20 points of credit, describe the process in a text writeup,
including pseudocode for the training and classification processes,
and walk through the classification process on a tiny sample text with
only 10 tokens. To receive full credit, turn in a README file and a
working program that we can test new input files against with a
command-line execution, and report the Precision and Recall of your
classifier on the test examples. See the slides from week9 for the
equations for Precision and Recall. Your working program should
expect an input file containing one message to be classified.
> spamFilter <
newtestfile.txt
classification: THIS IS SPAM