> submit c630aa lab6 (lab6_dir)
Where lab6_dir is the directory containing the files you want to submit.
In this homework, you have two options: you can treat this as a pencil-and-paper exercise, or as a programming assignment. If you successfully complete this homework (both parts) as a programming assignment, it will be worth an extra 25 bonus points.
A transportation pod has appeared and you get inside. There are three buttons on the control panel: scarlet, gray, and black. If you press one of these buttons it will take you to a place on campus. Some places have a reward: +1 if it's a good place, -1 if it's a bad place. Obviously you want to learn to get to the good places and avoid the bad ones. Any other state has a reward of -.04 (i.e. the agent would like to get to a positive reward rather than hanging around in the environment. Your job is to get an agent to figure out what's the right button to press at every place that the machine goes to (in this case, 10 places). In other words, you want to find the optimal policy from all states.
You should use the Bellman update equation (17.6) to perform value iteration, i.e.:
U(s,i+1) <-- R(s) + gamma * maxa sums' T(s,a,s')U(s',i)
where U(s,i) is the utility of state s at iteration i, R(s) is the reward in state s, T is the transition probability from s to s' given action a. Assume that gamma=1 for these problems.
Use the following algorithm (based on value iteration) to compute the utilities by exploring the space. You will need to keep track of the utilities of the states you encounter.
State: 1 Best: Blue Utility(1)= -.04 + (1*-.02) = -.06You may choose to do this via a computer program, in which case you should output this information as well as turn in the code. Here are the descriptions for each state and the transition table for the states. The description file has the following format:
statenumber,place name,image file,reward,commentYou only care aboue the statenumber and reward (possibly the place name as well). The transitions file has the format:
transition_from_state,action,transition_to_state1,prob1,transition_to_state2,prob2,...For this first problem, the actions are deterministic, so prob1 should be 1.0 and there won't be other states that are transitioned to.
The result you compute should have for each state the utility and policy for that state:
Results ------- State 1: Utility=0.2 Policy=Blue State 2: Utility=0.3 Policy=Red ...
(This example may or may not be related to the actual best policy; it's just for demonstation.)
Now, when you do the Bellman updates, there may be more than one state in the summation. The transition probabilities T(s,a,s') are given by the actions on the web page.
Again, if you want to do this in a computer program, the transition tables for problem 2 are here.
Write up your answers in the same way: record all of the states you visit and the utility updates, and then the final utilities/policy.