Determinized Sparse Partially Observable Tree

key idea: let’s build a tree such that, after taking the action, the observation is deterministic. Therefore, you get a belief tree with no branching on observations.

DESPOT trees

We make an assumption, that the actual observation given are fixed given belief. That is:

\begin{equation} O(o|b,a) = 1 \end{equation}

for some specific \(o\), everything else is \(0\) for every b,a.

Sample Scenarios

To make such a tree, let’s sample of set of scenarios: sequences of actions and observations (because, given a belief and action, we assume observation is fixed. So, given an initial belief and an action, you will always go down a single “scenario”).

Build Tree

Do a bunch of scenario sampling

Use Tree

Starting at where you are in terms of initial belief, greedily choose the “best action”.

Evaluate Tree

Average discounted future reward of the scenarios that relate to your starting states.

Drawbacks

Normal DESPOT is very very easy to overfit.

Regularized DESPOT

Build a DESPOT until depth \(D\), with \(K\) senarios, then, treat the resulting tree as a conditional plan, do bottom-up DP to optimize the plan.

Given a set of senarios \(\Phi\), we write:

Anytime DESPOT

We build up the despot tree by maintaining upper and lower bounds on the value function, and try to expand on scenarios that would help us lower the gap between them.

First, pick an upper and lower bound. The note on HSVI may help.

Build and Optimize Bounded DESPOT tree (see below) starting at \(b_{0}\)
Compute the optimal policy using Regularized DESPOT expression above
Execute best action
Get observation
\(update(b,a,o)\)

Building Bounded DESPOT

sample a set of \(\Phi\) senarios at \(b_0\)
insert \(b_0\) into the tree as the root node
let \(b \leftarrow b_0\), and, as time permits:
1. tighten bounds on \(b\)
2. back up the upper and lower bounds you found all the way up the tree as you would with HSVI

Tightening Bounds

if \(b\) is a leaf on the tree, then add new belief nodes for every action and every observation as children of \(b\).

then,

\begin{equation} b \leftarrow update(b, a^{*}, o^{*}) \end{equation}

where \(a^{*}\) and \(o^{*}\) are determined by…

\(a^{*}\): IE-MAX Heuristic, where the original upper-bound is \(Q\)
\(o^{*}\): weighted excess uncertainty

If the weighted excess uncertainty we got is non-zero, we repeat this tightening bounds process until it is zero.

DESPOT Theoretic Guarantees

It is near-optimal as a policy.