CSE630 Homework 6: Reinforcement learning

Due: Must be submitted by Friday, June 3 at 11:59 PM
1 point late penalty for each minute late

Submit instructions:

Write up all of your answers in a text editor so that it can be submitted electronically. Put all of the files you want to submit in a directory on STDSUN, and submit it using the command:

> submit c630aa lab6 (lab6_dir)

Where lab6_dir is the directory containing the files you want to submit.

General problem description

In this homework, you have two options: you can treat this as a pencil-and-paper exercise, or as a programming assignment. If you successfully complete this homework (both parts) as a programming assignment, it will be worth an extra 25 bonus points.

A transportation pod has appeared and you get inside. There are three buttons on the control panel: scarlet, gray, and black. If you press one of these buttons it will take you to a place on campus. Some places have a reward: +1 if it's a good place, -1 if it's a bad place. Obviously you want to learn to get to the good places and avoid the bad ones. Any other state has a reward of -.04 (i.e. the agent would like to get to a positive reward rather than hanging around in the environment. Your job is to get an agent to figure out what's the right button to press at every place that the machine goes to (in this case, 10 places). In other words, you want to find the optimal policy from all states.

You should use the Bellman update equation (17.6) to perform value iteration, i.e.:

U(s,i+1) <-- R(s) + gamma * maxa sums' T(s,a,s')U(s',i)

where U(s,i) is the utility of state s at iteration i, R(s) is the reward in state s, T is the transition probability from s to s' given action a. Assume that gamma=1 for these problems.

Problem 1

Problem 1 is a world with deterministic actions, meaning that if you do action a in state s, it will lead to some state s' with probability 1 and for all other states, probability 0. Thus T(s,a,s') = 1 for that state s'.

Use the following algorithm (based on value iteration) to compute the utilities by exploring the space. You will need to keep track of the utilities of the states you encounter.

You should record each state you are in, the "best action" and the updates you make to the utility. For example, if you are in state 1 (which has reward -.04) and the "Blue" action takes you to state 2 (which has utility -.02) and the "Red" action takes you to state 3 (which has utility -.05):

State: 1  Best: Blue  Utility(1)= -.04 + (1*-.02) = -.06 

You may choose to do this via a computer program, in which case you should output this information as well as turn in the code. Here are the descriptions for each state and the transition table for the states. The description file has the following format:
statenumber,place name,image file,reward,comment
You only care aboue the statenumber and reward (possibly the place name as well). The transitions file has the format:
transition_from_state,action,transition_to_state1,prob1,transition_to_state2,prob2,...
For this first problem, the actions are deterministic, so prob1 should be 1.0 and there won't be other states that are transitioned to.

The result you compute should have for each state the utility and policy for that state:

Results
-------
State 1: Utility=0.2 Policy=Blue
State 2: Utility=0.3 Policy=Red
...

(This example may or may not be related to the actual best policy; it's just for demonstation.)

Example

We've created a related example to show you what the output should look like. Note that the reward structure is slightly different here. (You may find it easier to have two browser windows open at once to look at what's going on.)

Problem 2

Problem 2 uses the same state space, but now when you choose an action, it is non-deterministic. The probabilities of reaching each state are shown by the action link. The probabilities do not change over time.

Now, when you do the Bellman updates, there may be more than one state in the summation. The transition probabilities T(s,a,s') are given by the actions on the web page.

Again, if you want to do this in a computer program, the transition tables for problem 2 are here.

Write up your answers in the same way: record all of the states you visit and the utility updates, and then the final utilities/policy.