Partially Observable Markov Decision Process is a with .
Components:
- states
- actions (given state)
- transition function (given state and actions)
- reward function
- Belief System
- beliefs
- observations
- observation model \(O(o|a,s’)\)
As always we desire to find a \(\pi\) such that we can:
\begin{equation} \underset{\pi \in \Pi}{\text{maximize}}\ \mathbb{E} \qty[ \sum_{t=0}^{\infty} \gamma^{t} R(b_{t}, \pi(b_{t}))] \end{equation}
whereby our \(\pi\) instead of taking in a state for input takes in a belief (over possible states) as input.
observation and states
“where are we, and how sure are we about that?”
policy representations
“how do we represent a policy”
- a tree: conditional plan
- a graph:
- with utility: +
- just take the top action of the conditional plan the alpha-vector was computed from
policy evaluations
“how good is our policy / what’s the utility?”
policy solutions
“how do we make that policy better?”
exact solutions
approximate solutions
- estimate an , and then use a policy representation:
upper-bounds for s
lower-bounds for s