CSE630 Homeworks 5 & 6: Uncertainty, Decision Theory, and Reinforcement Learning
Due: Must be submitted by Wednesday, November 29 at 11:59
1 point late
penalty for each minute late
Submit instructions:
Write up all of your
answers in a text editor so
that it can be submitted electronically. Put all of the files
you want to submit in a directory on STDSUN, and submit it using the command:
> submit c630aa lab5 (lab5_dir)
Where lab5_dir is the directory containing the files you want to
submit.
This homework is worth 200 points total. The first 100 points will be credited to homework 5, the rest to homework 6. If you're planning to drop homework 6 then you can do only 100 points of the homework.
PART I: Deal or no deal (80 points)
I talked about the game show "Deal or No Deal" in class; if you
haven't seen ths show, a contestant chooses one of 26 suitcases which
could contain one of the following amounts:
| .01 | 1,000 |
| 1 | 5,000 |
| 5 | 10,000 |
| 10 | 25,000 |
| 25 | 50,000 |
| 50 | 75,000 |
| 75 | 100,000 |
| 100 | 200,000 |
| 200 | 300,000 |
| 300 | 400,000 |
| 400 | 500,000 |
| 500 | 750,000 |
| 750 | 1,000,000 |
The contestant must then open other suitcases, which gives both the
contestant and his/her opponent, the banker, information about amounts
that are not in the suitcase. After the contestant opens a certain
number of suitcases, the banker makes an offer to buy the case for a
certain amount of money, which the contestant can accept (deal) or
reject (no deal). If the contestant accepts, then the game is over;
if the contestant rejects, then the game continues with more cases
being opened.
- (10 pts) When the game starts, what's the probability that the contestant
has selected a case that is worth $100,000 or more?
- (10 pts) In one game, the contestant has opened cases with $100, $200,
$75,000, $100,000, $200,000, and $500,000. What is the probability
that the contestant has selected a case that is worth $100,000 or
more?
- (15 pts) As long as a contestant does not open the $1,000,000 case, give
the function that describes the probability of having the $1,000,000
case given that the contestant has opened n cases thus far.
- (20 pts) Part of the banker's job is to judge how risky a contestant is
willing to be. Let's imagine two contestants, A and B, who have
identical situations (same cases opened, same offer from the banker).
A makes a quick decision to reject the offer. B hesitates for a long
time, seems on the verge of taking the deal, and then rejects it.
What can you say about the offer and A's and B's preferences? Hint:
think lottery.
- (10 pts) Clearly, knowing the type of player helps the banker. Given the
above question, how can the banker try to figure out whether the
player is more like player A or player B early on?
- (15 pts) Imagine a scenario where the another player has five cases alive:
$1,000, $10,000, $20,000, $400,000, and $750,000. She's offered the
expected monetary value of the cases in play, and rejects the offer.
Interviewed after the game, she states that if she was given $10,000
more, then she would have taken the deal. What does this say about
the bounds of the utility of participating in the game? Your answer
should be a set of inequalities.
PART II: Reinforcement Learning (120 points)
In this homework, you have two options: you can treat this as a
pencil-and-paper exercise (max 100 points), or as a programming
assignment (max 120 points).
A transportation pod has appeared and you get inside. There are three
buttons on the control panel: scarlet, gray, and black. If you press
one of these buttons it will take you to a place on campus. Some
places have a reward: +1 if it's a good place, -1 if it's a bad place.
Obviously you want to learn to get to the good places and avoid the
bad ones. Any other state has a reward of -.05 (i.e. the agent would
like to get to a positive reward rather than hanging around in the
environment. Your job is to get an agent to figure out what's the
right button to press at every place that the machine goes to (in this
case, 10 places). In other words, you want to find the optimal policy
from all states.
You should use the Bellman update equation (17.6) to perform value iteration, i.e.:
U(s,i+1) <-- R(s) + gamma * maxa sums' T(s,a,s')U(s',i)
where U(s,i) is the utility of state s at iteration i, R(s) is the
reward in state s, T is the transition probability from s to s' given
action a. Assume that gamma=1 for these problems.
Problem 1 (70 points)
Problem 1 is a world with deterministic actions, meaning that if you do action a in state s, it will lead to some state s' with probability 1 and for all other states, probability 0. Thus T(s,a,s') = 1 for that state s'.
Use the following algorithm (based on value iteration) to compute the
utilities by exploring the space. You will need to keep track of the
utilities of the states you encounter.
- Start by setting the utilites of all states to 0. (If you encounter a new state in your travels, assume its utility is 0.)
- Enter the problem. This will take you to a random state.
- For every state:
- Compute the Bellman update of the utility for that state. To do this, you will need to note the utility of all of your neighbors. In the deterministic case, you'll be summing over just one s' (but multiple actions!). In a terminal state, there are no possible actions, so the sum over the actions is zero.
- If you are in a non-terminal state, choose an action and you'll be taken to a new state.
- If you are in a terminal state, choose reset and you will be taken to a random starting point.
- When you feel that you have converged (i.e. the utility numbers stop changing or only change a little), compute a policy from the utilities.
You should record each state you are in, the "best action" and the
updates you make to the utility. For example, if you are in state 1
(which has reward -.05) and the "Purple" action takes you to state 2
(which has utility -.02) and the "Green" action takes you to state 3 (which
has utility -.05):
State: 1 Best: Purple Utility(1)= -.05 + (1*-.02) = -.07
You may choose to do this via a computer program, in which case you
should output this information as well as turn in the code. Here are
the descriptions for each
state and the transition table for
the states. The description file has the following format:
statenumber,place name,image file,reward,comment
You only care aboue the statenumber and reward (possibly the place
name as well). The transitions file has the format:
transition_from_state,action,transition_to_state1,prob1,transition_to_state2,prob2,...
For this first problem, the actions are deterministic, so prob1 should
be 1.0 and there won't be other states that are transitioned to.
Actions are encoded "S" for scarlet, "G" for gray, and "B" for black.
The result you compute should have for each state the utility and
policy for that state:
Results
-------
State 1: Utility=0.2 Policy=Scarlet
State 2: Utility=0.3 Policy=Gray
...
(This example may or may not be related to the actual best policy;
it's just for demonstation.)
Problem 2 (50 points)
Problem 2 uses
the same state space, but now when you choose an action, it is
non-deterministic. The probabilities of reaching each state are shown
by the action link. The probabilities do not change over time.
Now, when you do the Bellman updates, there may be more than one state
in the summation. The transition probabilities T(s,a,s') are given by
the actions on the web page.
Again, if you want to do this in a computer program, the transition tables for problem 2
are here.
Write up your answers in the same way: record all of the states you
visit and the utility updates, and then the final utilities/policy.