What if our initial state never change or is deterministically changing? For instance, say, for localization. This should make solving a POMDP easier.
POMDP-lite
- \(X\) fully observable states
- \(\theta\) hidden parameter: finite amount of values \(\theta_{1 \dots N}\)
- where \(S = X \times \theta\)
we then assume conditional independence between \(x\) and \(\theta\). So: \(T = P(x’|\theta, x, a)\), where \(P(\theta’|\theta,x,a) = 1\) (“our hidden parameter is known or deterministically changing”)
Solving
Main Idea: if that’s the case, then we can split our models into a set of MDPs. Because \(\theta_{j}\) change deterministically, we can have a MDP solved ONLINE over \(X\) and \(T\) for each possible initial \(\theta\). Then, you just take the believe over \(\theta\) and sample over the MDPs based on that belief.
Reward bonus
To help coordination, we introduce a reward bonus
- exploration reward bonus, which encourages exploration (this helps coordinate)
- maintain a value \(\xi(b,x,a)\) which is the number of times b,x,a is visited—if it exceeds a number of times, clip reward bonus
Whereby:
\begin{equation} RB(b,s,a) = \beta \sum_{s’}^{} P(s’|b,s,a) || b_{s} - b ||_{1} \end{equation}
which encourages information gain by encouraging exploring states with more \(L_{1}\) divergence in belief compared to our current belief.
Then, we can formulate an augmented reward function \(\tilde{R}(b,s,a) = R(s,a) + RB(b,s,a)\).
Solution
Finally, at each timestamp, we look at our observation and assume it does not change. This gives an MDP:
\begin{equation} \tilde{V}^{*} (b,s) = \max_{a} \left\{ \tilde{R}(b,s,a) + \gamma \sum_{s’}^{} P(s’|b,s,a) \tilde{V}^{*} (b,s’)\right\} \end{equation}
which we solve however we’d like. Authors used UCT.