Big picture: combining off-line and on-line approaches maybe the best way to tackle large POMDPs.
Try planning:
- only where we are
- only where we can reach
Take into account three factors:
- uncertainty in the value function
- reachability from the current belief
- actions that are likely optimal
It allows policy improvement on any base policy.
Setup
Discrete POMDPs:
- \(L\), lower bound
- \(U\), upper-bound
- \(b_0\): current belief
Two main phases: the algorithm
Planning Phase
- at each belief point, choose a particular next node to expand (using the scheme below to score the nodes)
- expand that next node that are chosen
- propagate the value of the belief upwards through POMDP Bellman Backup up through the tree
Best Node Selection
We select the best node by three metrics:
- Uncertainty: \(\epsilon(b) = U(b)-L(b)\) we want small gap between upper and lower bound
- Optimality in actions:
- AEMS1: \(p(a|b) = \frac{U(a,b)-L(b)}{U(a^{*}, b)-L(b)}\) (“what’s the relative optimality of our action, compared to best action”)
- AEMS2: \(p(a|b)\) = \(1\) if \(a=A^{*}\), \(0\) otherwise. (“just take best action”)
- Reachability: \(p(b) = \prod_{i=0}^{d} P(o^{(i)} | b^{(i)}, a^{(i)}) p(a^{(i)}|b^{(i)}})\), where small \(p\) is either AIMS 1 or 2 above, where \(a\) comes from the best action conditional plan that came so far
Combining the metrics gives:
\begin{equation} E(b) = \gamma P(b) \epsilon(b) \end{equation}
Execution
- execute the best action at \(b_0\)
- Perceive a new observation
- \(b_0 \leftarrow update(b_0,a,o)\)