We apply the Bellman Expectation Equation and selecting the utility that is calculated by taking the most optimal action given the current utility:
\begin{equation} U_{k+1}(s) = \max_{a} \qty(R(s,a) + \gamma \sum_{s’} T(s’ | s,a) U_{k}(s’)) \end{equation}
This iterative process is called the Bellman backup, or Bellman update.
\begin{equation} U_1 \dots U_{k} \dots U^{*} \end{equation}
eventually will converge into the optimal value function. After which, we just extract the greedy policy from the utility to get a policy to use.
We stop when the Bellman Residual hits a the desired error threshold:
Bellman Residual
Take the L-\(\infty\) norm of \(U^{k+1}-U^{k}\) (that is, take \(||U_{k+1} - U_{k}||_{\infty}\). We call that the Bellman Residual. If this Bellman Residual drops below \(\delta\), it is shown that the error between \(U^{*}\) (convergence) and \(U_{k}\) will only be:
\begin{equation} \epsilon = \frac{\delta \gamma}{(1-\gamma)} \end{equation}
So as long as the Bellman Residual between your two updates \(\leq \delta\), you know that you are at most \(\epsilon\) away from the optimal utility.
You will note that as future discount \(\gamma \to 1\), this error bound becomes much larger. Therefore, you have to iterate more to get to the same \(\epsilon\). You need more iterations when \(\gamma \to 1\).
Notably, the loss of some arbitrary utility derived from policy evaluation is:
\begin{equation} || U^{\pi} - U^{*} || < \frac{2\epsilon \gamma}{1-\gamma} \end{equation}
asynchronous value iteration
We choose an ordering of states. We then loop through the entire list, updating the value function. Then, we loop through this system multiple times until the system converged.
That is, instead of creating a list of things \(U_{k}\), keeping only the current current one in memory, we come up with some:
\begin{equation} U(s) \leftarrow \max_{a} \qty(R(s,a) + \gamma \sum_{s’} T(s’ | s,a) U(s’)) \end{equation}
The idea is, instead of keeping all of the \(U_{k-1}\) until you have calculated all of \(U_{k}\) for each state, we just use an ordering of the states to just use whatever value we calculated last.
time complexity
\begin{equation} O(S^{2}A) \end{equation}
where \(S\) is the number of states and \(A\) the number of actions.
- loop over all states in each update
- loop over all actions to figure out the max
- loop over all next states and calculate their utility
POMDP value-iteration
- compute alpha vectors for all one-step plans (i.e. conditional plans that does just one action and gives up)
- alpha vector pruning on any plans that are dominated
- generate all possible two-step conditional plans over all actions using combinations of non-pruned one-step plans above as SUBPLANS (yes, you can use a one-step plan twice)
- repeat steps 2-3
see also performing value-iteration naively with one-step lookahead in POMDP.
POMDP Bellman Update
Say you want to extract a policy out of a bunch of alpha vectors.
Let \(\alpha \in \Gamma\), a set of alpha vectors; we obtain a new alpha vector \(U’(b) = [U(s_0) \dots U(s_{n})]\) by:
\begin{equation} U’(b) = \max_{a}\qty[R(b,a)+\gamma \qty(\sum_{o}^{}P(o|b,a) U(b))] \end{equation}
where:
\begin{equation} R(b,a) = \sum_{s}^{} R(s,a)b(s) \end{equation}
\begin{align} P(o|b,a) &= \sum_{s}^{} p(o|s,a) b(s) \\ &= \sum_{s}^{} \sum_{s’}^{} T(s’|s,a) O(o|s’,a) b(s) \end{align}
and
\begin{equation} U^{\Gamma}(b) = \max_{\alpha \in \Gamma} \alpha^{\top} b \end{equation}