CSE630 Homeworks 5 & 6: Uncertainty, Decision Theory, and Reinforcement Learning

Due: Must be submitted by Wednesday, November 29 at 11:59
1 point late penalty for each minute late

Submit instructions:

Write up all of your answers in a text editor so that it can be submitted electronically. Put all of the files you want to submit in a directory on STDSUN, and submit it using the command:

> submit c630aa lab5 (lab5_dir)

Where lab5_dir is the directory containing the files you want to submit.

This homework is worth 200 points total. The first 100 points will be credited to homework 5, the rest to homework 6. If you're planning to drop homework 6 then you can do only 100 points of the homework.

PART I: Deal or no deal (80 points)

I talked about the game show "Deal or No Deal" in class; if you haven't seen ths show, a contestant chooses one of 26 suitcases which could contain one of the following amounts:

.01 1,000
1 5,000
5 10,000
10 25,000
25 50,000
50 75,000
75 100,000
100 200,000
200 300,000
300 400,000
400 500,000
500 750,000
750 1,000,000

The contestant must then open other suitcases, which gives both the contestant and his/her opponent, the banker, information about amounts that are not in the suitcase. After the contestant opens a certain number of suitcases, the banker makes an offer to buy the case for a certain amount of money, which the contestant can accept (deal) or reject (no deal). If the contestant accepts, then the game is over; if the contestant rejects, then the game continues with more cases being opened.

  1. (10 pts) When the game starts, what's the probability that the contestant has selected a case that is worth $100,000 or more?
  2. (10 pts) In one game, the contestant has opened cases with $100, $200, $75,000, $100,000, $200,000, and $500,000. What is the probability that the contestant has selected a case that is worth $100,000 or more?
  3. (15 pts) As long as a contestant does not open the $1,000,000 case, give the function that describes the probability of having the $1,000,000 case given that the contestant has opened n cases thus far.
  4. (20 pts) Part of the banker's job is to judge how risky a contestant is willing to be. Let's imagine two contestants, A and B, who have identical situations (same cases opened, same offer from the banker). A makes a quick decision to reject the offer. B hesitates for a long time, seems on the verge of taking the deal, and then rejects it. What can you say about the offer and A's and B's preferences? Hint: think lottery.
  5. (10 pts) Clearly, knowing the type of player helps the banker. Given the above question, how can the banker try to figure out whether the player is more like player A or player B early on?
  6. (15 pts) Imagine a scenario where the another player has five cases alive: $1,000, $10,000, $20,000, $400,000, and $750,000. She's offered the expected monetary value of the cases in play, and rejects the offer. Interviewed after the game, she states that if she was given $10,000 more, then she would have taken the deal. What does this say about the bounds of the utility of participating in the game? Your answer should be a set of inequalities.

PART II: Reinforcement Learning (120 points)

In this homework, you have two options: you can treat this as a pencil-and-paper exercise (max 100 points), or as a programming assignment (max 120 points).

A transportation pod has appeared and you get inside. There are three buttons on the control panel: scarlet, gray, and black. If you press one of these buttons it will take you to a place on campus. Some places have a reward: +1 if it's a good place, -1 if it's a bad place. Obviously you want to learn to get to the good places and avoid the bad ones. Any other state has a reward of -.05 (i.e. the agent would like to get to a positive reward rather than hanging around in the environment. Your job is to get an agent to figure out what's the right button to press at every place that the machine goes to (in this case, 10 places). In other words, you want to find the optimal policy from all states.

You should use the Bellman update equation (17.6) to perform value iteration, i.e.:

U(s,i+1) <-- R(s) + gamma * maxa sums' T(s,a,s')U(s',i)

where U(s,i) is the utility of state s at iteration i, R(s) is the reward in state s, T is the transition probability from s to s' given action a. Assume that gamma=1 for these problems.

Problem 1 (70 points)

Problem 1 is a world with deterministic actions, meaning that if you do action a in state s, it will lead to some state s' with probability 1 and for all other states, probability 0. Thus T(s,a,s') = 1 for that state s'.

Use the following algorithm (based on value iteration) to compute the utilities by exploring the space. You will need to keep track of the utilities of the states you encounter.

You should record each state you are in, the "best action" and the updates you make to the utility. For example, if you are in state 1 (which has reward -.05) and the "Purple" action takes you to state 2 (which has utility -.02) and the "Green" action takes you to state 3 (which has utility -.05):

State: 1  Best: Purple  Utility(1)= -.05 + (1*-.02) = -.07 

You may choose to do this via a computer program, in which case you should output this information as well as turn in the code. Here are the descriptions for each state and the transition table for the states. The description file has the following format:
statenumber,place name,image file,reward,comment
You only care aboue the statenumber and reward (possibly the place name as well). The transitions file has the format:
transition_from_state,action,transition_to_state1,prob1,transition_to_state2,prob2,...
For this first problem, the actions are deterministic, so prob1 should be 1.0 and there won't be other states that are transitioned to. Actions are encoded "S" for scarlet, "G" for gray, and "B" for black.

The result you compute should have for each state the utility and policy for that state:

Results
-------
State 1: Utility=0.2 Policy=Scarlet
State 2: Utility=0.3 Policy=Gray
...

(This example may or may not be related to the actual best policy; it's just for demonstation.)

Problem 2 (50 points)

Problem 2 uses the same state space, but now when you choose an action, it is non-deterministic. The probabilities of reaching each state are shown by the action link. The probabilities do not change over time.

Now, when you do the Bellman updates, there may be more than one state in the summation. The transition probabilities T(s,a,s') are given by the actions on the web page.

Again, if you want to do this in a computer program, the transition tables for problem 2 are here.

Write up your answers in the same way: record all of the states you visit and the utility updates, and then the final utilities/policy.