CSE630 Homework 6: Reinforcement learning

Due: Must be submitted by Saturday Dec 01 at 04:00(a.m.)
1 point late penalty for each minute late

Submit instructions:

Write up all of your answers in a text editor so that it can be submitted electronically. Put all of the files you want to submit in a directory on STDSUN, and submit it using the command:

> submit c630aa lab6 (lab6_dir)

Where lab6_dir is the directory containing the files you want to submit.

General problem description

In this homework, you have two options: you can treat this as a pencil-and-paper exercise, or as a programming assignment. If you complete the programming assignment, it will be worth an extra 20 bonus points.

We have set up a little world for you to explore, treasure-hunt style. The idea is to learn a sequence of actions that will maximize your chance of getting a reward, given that you will start from a random state.

The world is laid out as a number of states. Actions will allow you to transition from one state to another. Each state has a reward: if you find a prize, the reward is +1. If you find something bad, the reward is -1. Both of these states are terminal (i.e., there's no way out). Any other state has a reward of -.04 (i.e. the agent would like to get to a positive reward rather than hanging around in the environment.

Your mission is to develop a policy that will maximize the chance of the agent getting the reward.

You should use the Bellman update equation (17.6) to perform value iteration, i.e.:

U(s,i+1) <-- R(s) + gamma * maxa sums' T(s,a,s')U(s',i)

where U(s,i) is the utility of state s at iteration i, R(s) is the reward in state s, T is the transition probability from s to s' given action a. Assume that gamma=1 for these problems.

Problem 1

Problem 1 is a world with deterministic actions, meaning that if you do action a in state s, it will lead to some state s' with probability 1 and for all other states, probability 0. Thus T(s,a,s') = 1 for that state s'.

Use the following algorithm (based on value iteration) to compute the utilities by exploring the space. You will need to keep track of the utilities of the states you encounter.

You should record each state you are in, the "best action" and the updates you make to the utility. For example, if you are in state 1 (which has reward -.04) and the "Blue" action takes you to state 2 (which has utility -.02) and the "Red" action takes you to state 3 (which has utility -.05):

State: 1  Best: Blue  Utility(1)= -.04 + (1*-.02) = -.06 

You may choose to do this via a computer program, in which case you should output this information as well as turn in the code. Here is the transition table for the states, where each line has the format "state;Reward;#Actions;Action1,#transitions,State1a,Prob1a,State1b,Prob1b;Action2,#transitions,State2a,Prob2a,State2b,Prob2b,...".

The result you compute should have for each state the utility and policy for that state:

Results
-------
State 1: Utility=0.2 Policy=Blue
State 2: Utility=0.3 Policy=Red
...

(This example may or may not be related to the actual best policy; it's just for demonstation.)

Example

We've created a related example to show you what the output should look like. Note that the reward structure is slightly different here. (You may find it easier to have two browser windows open at once to look at what's going on.)

Problem 2

Problem 2 uses the same type of state space (although a different actual space), but now when you choose an action, it is non-deterministic. The probabilities of reaching each state are shown by the action link. The probabilities do not change over time.

Now, when you do the Bellman updates, there may be more than one state in the summation. The transition probabilities T(s,a,s') are given by the actions on the web page.

Again, if you want to do this in a computer program, the transition tables for problem 2 are here.

Write up your answers in the same way: record all of the states you visit and the utility updates, and then the final utilities/policy.