> submit c630aa lab6 (lab6_dir)
Where lab6_dir is the directory containing the files you want to submit.
In this homework, you have two options: you can treat this as a pencil-and-paper exercise, or as a programming assignment. If you complete the programming assignment, it will be worth an extra 20 bonus points.
We have set up a little world for you to explore, treasure-hunt style. The idea is to learn a sequence of actions that will maximize your chance of getting a reward, given that you will start from a random state.
The world is laid out as a number of states. Actions will allow you to transition from one state to another. Each state has a reward: if you find a prize, the reward is +1. If you find something bad, the reward is -1. Both of these states are terminal (i.e., there's no way out). Any other state has a reward of -.04 (i.e. the agent would like to get to a positive reward rather than hanging around in the environment.
Your mission is to develop a policy that will maximize the chance of the agent getting the reward.
You should use the Bellman update equation (17.6) to perform value iteration, i.e.:
U(s,i+1) <-- R(s) + gamma * maxa sums' T(s,a,s')U(s',i)
where U(s,i) is the utility of state s at iteration i, R(s) is the reward in state s, T is the transition probability from s to s' given action a. Assume that gamma=1 for these problems.
Use the following algorithm (based on value iteration) to compute the utilities by exploring the space. You will need to keep track of the utilities of the states you encounter.
State: 1 Best: Blue Utility(1)= -.04 + (1*-.02) = -.06You may choose to do this via a computer program, in which case you should output this information as well as turn in the code. Here is the transition table for the states, where each line has the format "state;Reward;#Actions;Action1,#transitions,State1a,Prob1a,State1b,Prob1b;Action2,#transitions,State2a,Prob2a,State2b,Prob2b,...".
The result you compute should have for each state the utility and policy for that state:
Results ------- State 1: Utility=0.2 Policy=Blue State 2: Utility=0.3 Policy=Red ...
(This example may or may not be related to the actual best policy; it's just for demonstation.)
Now, when you do the Bellman updates, there may be more than one state in the summation. The transition probabilities T(s,a,s') are given by the actions on the web page.
Again, if you want to do this in a computer program, the transition tables for problem 2 are here.
Write up your answers in the same way: record all of the states you visit and the utility updates, and then the final utilities/policy.