Improving PBVI without sacrificing quality.
Initialization
We first initialize HSVI with a set of alpha vectors \(\Gamma\), representing the lower-bound, and a list of tuples of \((b, U(b))\) named \(\Upsilon\), representing the upper-bound. We call the value functions they generate as \(\bar{V}\) and \(\underline V\).
Lower Bound
Set of alpha vectors: best-action worst-state (HSVI1), blind lower bound (HSVI2)
Calculating \(\underline{V}(b)\)
\begin{equation} \underline{V}_{\Gamma} = \max_{\alpha} \alpha^{\top}b \end{equation}
Upper Bound
- solving fully-observable MDP
- Project \(b\) into the point-set
- Projected the upper bound onto a convex hull (HSVI2: via approximate convex hull projection)
Calculating \(\bar{V}(b)\)
Recall that though the lower-bound is given by alpha vectors, the upper bound is given in terms of a series of tuples \((b, U(b)) \in \Upsilon\).
- HSVI1: we figure the upper bound for any given \(b\) by projecting onto the convex hull formed by points on \(\Upsilon\)
- HSVI2: approximate linear projection
Update
Begin with state \(b = b_0\).
Repeat:
at every step, we perform a local update for upper and lower bound using the current \(b\)
- the lower bound is updated using PBVI Backup on \(b, \Gamma\)
- the upper bound is updated using POMDP Bellman Update on \(b, \Upsilon\), putting the new \((b, u(b))\) in the set \(\Upsilon\).
Then, we update our belief via the usual:
\begin{equation} b \leftarrow update(b, a^{*}, o^{*}) \end{equation}
where \(a^{*}\) and \(o^{*}\) are determined by…
IE-MAX Heuristic
IE-MAX Heuristic is used to determine \(a^{*}\), whereby we choose the action such that:
\begin{equation} a^{*} = \arg\max_{a}Q^{(\bar{V})}(b) \end{equation}
yes, we choose the next action which maximizes the upper bound of the utility we can get.
weighted excess uncertainty
weighted excess uncertainty is used to determine \(o^{*}\). Suppose we are depth \(d\) loops in the search tree (i.e. this is our $d$th chain), we define:
\begin{equation} \text{excess}(b,t) = (\bar{V}(b)-\underline{V}(b)) - \epsilon \gamma^{-t} \end{equation}
“how far away are we from converging to a value uncertainty of no more than \(\epsilon\), given we are depth \(t\) in?
and, we choose the observation \(o^{*}\) such that:
\begin{equation} o^{*} = \arg\max_{o} \qty[p(o|b,a^{*}) \text{excess}(update(b,a,o), t+1)] \end{equation}
where,
\begin{align} P(o|b,a) &= \sum_{s}^{} p(o|s,a) b(s) \\ &= \sum_{s}^{} \sum_{s’}^{} T(s’|s,a) O(o|s’,a) b(s) \end{align}