Anytime Error Minimization Search

Big picture: combining off-line and on-line approaches maybe the best way to tackle large POMDPs.

Try planning:

Take into account three factors:

It allows policy improvement on any base policy.

Setup

Discrete POMDPs:

Two main phases: the algorithm

at each belief point, choose a particular next node to expand (using the scheme below to score the nodes)
expand that next node that are chosen
propagate the value of the belief upwards through POMDP Bellman Backup up through the tree

We select the best node by three metrics:

Uncertainty: \(\epsilon(b) = U(b)-L(b)\) we want small gap between upper and lower bound
Optimality in actions:
- AEMS1: \(p(a|b) = \frac{U(a,b)-L(b)}{U(a^{*}, b)-L(b)}\) (“what’s the relative optimality of our action, compared to best action”)
- AEMS2: \(p(a|b)\) = \(1\) if \(a=A^{*}\), \(0\) otherwise. (“just take best action”)
Reachability: \(p(b) = \prod_{i=0}^{d} P(o^{(i)} | b^{(i)}, a^{(i)}) p(a^{(i)}|b^{(i)}})\), where small \(p\) is either AIMS 1 or 2 above, where \(a\) comes from the best action conditional plan that came so far

Combining the metrics gives:

\begin{equation} E(b) = \gamma P(b) \epsilon(b) \end{equation}