
The value iteration method
In the simplistic example we just saw, to calculate the values of states and actions, we have exploited the structure of the environment: we had no loops in transitions, so we could start from terminal states, calculate their values and then proceed to the central state. However, just one loop in the environment builds an obstacle in our approach. Let's consider such an environment with two states:

Figure 7: A sample environment with a loop in the transition diagram
We start from state , and the only action we can take leads us to state
. We get reward r=1,and the only transition from
is an action, which brings us back to the
. So, the life of our agent is an infinite sequence of states [
]. To deal with this infinity loop, we can use a discount factor
. Now, the question is, what are the values for both the states?
The answer is not very complicated, though. Every transition from to
gives us a reward of 1 and every back transition gives us 2. So, our sequence of rewards will be [1, 2, 1, 2, 1, 2, 1, 2, ….]. As there is only one action available in every state, our agent has no choice, so we can omit the max operation in formulas (there is only one alternative). The value for every state will be equal to the infinite sum:


Strictly speaking, we cannot calculate the exact values for our states, but with , the contribution of every transition quickly decreases over time. For example, after 10 steps,
, but after 100 steps it becomes just 0.0000266. Due to this, we can stop after 50 iterations and still get quite a precise estimation.
>>> sum([0.9**(2*i) + 2*(0.9**(2*i+1)) for i in range(50)]) 14.736450674121663 >>> sum([2*(0.9**(2*i)) + 0.9**(2*i+1) for i in range(50)]) 15.262752483911719
The preceding example could be used to get a gist of a more general procedure, called the "value iteration algorithm" which allows us to numerically calculate the values of states and values of actions of MDPs with known transition probabilities and rewards. The procedure (for values of states) includes the following steps:
- Initialize values of all states
to some initial value (usually zero)
- For every state s in the MDP, perform the Bellman update:
- Repeat step 2 for some large number of steps or until changes become too small
In the case of action values (that is Q), only minor modifications to the preceding procedure are required:
- Initialize all
to zero
- For every state s and every action a in this state, perform update:
- Repeat step 2
Okay, so that's the theory. What about the practice? In practice, this method has several obvious limitations. First of all, our state space should be discrete and small enough to perform multiple iterations over all states. This is not an issue for FrozenLake-4x4 and even for FrozenLake-8x8 (it exists in Gym as a more challenging version), but for CartPole it's not totally clear what to do. Our observation for CartPole is four float values, which represent some physical characteristics of the system. Even a small difference in those values could have an influence on the state's value. A potential solution for that could be discretization of our observation's values, for example, we can split the observation space of the CartPole into bins and treat every bin as an inpidual discrete state in space. However, this will create lots of practical problems, such as how large bin intervals should be and how much data from the environment we'll need to estimate our values. We'll address this issue in subsequent chapters, when we get to the usage of neural networks in Q-learning.
The second practical problem arises from the fact that we rarely know the transition probability for the actions and rewards matrix. Remember what interface provides Gym to the agent's writer: we observe the state, decide on an action and only then do we get the next observation and reward for the transition. We don't know (without peeking into Gym's environment code) what the probability is to get into state from state
by issuing action
. What we do have is just the history from the agent's interaction with the environment. However, in Bellman's update, we need both a reward for every transition and the probability of this transition. So, the obvious answer to this issue is to use our agent's experience as an estimation for both unknowns. Rewards could be used as they are. We just need to remember what reward we've got on transition from
to
, using action a, but to estimate probabilities we need to maintain counters for every tuple (
,
,
) and normalize them.
Okay, now let's look at how the value iteration method will work for FrozenLake.