For the first few problems, you can assume that you can tell with 100% accuracy what chip is installed. In the control room, there is a switch that controls what type of chip is installed; at the start of the day there's a 70% chance that it's in anti-polka mode, and 30% chance that it's in dummy mode. When the system is turned on, it starts installing chips into TVs on the conveyor belt. (You can assume that the state of the switch for any time t is the same as the kind of chip installed at time t -- that is, they are the same variable. THIS IS A SIGNIFICANT HINT. IT MAKES THE PROBLEM MUCH SIMPLER.)
However, some devious professor has let their chimpanzee into the control room, and she starts flipping the switches in the control room. The switch can only be flipped between TVs (i.e. we don't install half of a chip). There is a 30% chance between every installation that the switch will be flipped (changing the type of chip installed).
Now, what if a second set of chips is installed, controlling the language? There are three chips (LanguageA/LanguageB/LanguageC), and set of push-buttons that controls them as well (pushing a button for a particular language will exclusively set the chip to that language -- only one button is active at a time). The probability of the buttons being in LanguageA mode at the beginning of the day is 60%, and LanguageC is 20%. Again, the probability of the switch being flipped by the chimp is 30% between every TV (evenly divided between the other languages -- when in state "LanguageA" there's a 15% chance of LanguageB and 15% of LanguageC button pushing). The probability of the polka switch being flipped and language buttons being pushed is independent.
Now assume that you can't tell what kind of chips are inside the TV. However, you can weigh the TVs, and the chips will probabilistically affect the weight. Use the following table for P(Weight|Chip1, Chip2):
| Chip 1 | Chip 2 | P(weight=heavy|Chip1,Chip2) | P(weight=medium|Chip1,Chip2) | P(weight=light|Chip1,Chip2) |
|---|---|---|---|---|
| Anti-Polka | LanguageA | 0.7 | 0.2 | 0.1 |
| Anti-Polka | LanguageB | 0.5 | 0.3 | 0.2 |
| Anti-Polka | LanguageC | 0.4 | 0.3 | 0.3 |
| Dummy | LanguageA | 0.3 | 0.5 | 0.2 |
| Dummy | LanguageB | 0.1 | 0.6 | 0.3 |
| Dummy | LanguageC | 0.1 | 0.4 | 0.5 |
Your job is to train two different classifiers (below) to predict where I eat lunch. I have provided you training and test materials in three formats: Text-attribute format, with the last column the name of the restaurant (train) (test), A coded format, where the text attributes are replaced with integers (starting from 0, in the order above) (train) (test), and a binary "one-hot" version which codes the nine factors in the first 30 columns, and the restaurant in the last 4. A "one-hot" encoding assumes that if you have n options, the value you want to encode will be marked with a 1 and the others with a 0 (e.g., Oxley's is restaurant 2 of 4, so the last 4 columns would be 0,1,0,0) (train) (test). This last encoding may be useful for training a perceptron. The contents of the training files are exactly the same, just a different encoding; same with the test files.
You will find three sets of data here: file1(txt) file2(txt) file3(txt). You will need to use the code you developed in hw1 for reading in data files, and calculating means and standard deviations.
Create a program to estimate the means, standard deviations, and weights of a mixture of Gaussians via the EM algorithm. You will probably want to have three functions: one that performs the expectation step, one that performs the maximization step, and one that runs the outer loop.
> submit c730aa lab2 hw2
You may also choose to hand in the non-code portion of the assignment by 5pm on the due date in my mailbox.
FOR CODE, YOU MUST SUBMIT ONLINE. TAKE THE TIME TO FIND OUT HOW IF YOU DON'T KNOW HOW.
For the code, you should turn in either
Eric Fosler-Lussier Last modified: Wed Oct 17 21:49:37 EDT 2007