an Option (MDP) represents a high level collection of actions. Big Picture: abstract away your big policy into \(n\) small policies, and value-iterate over expected values of the big policies.
Markov Option
A Markov Option is given by a triple \((I, \pi, \beta)\)
- \(I \subset S\), the states from which the option maybe started
- \(S \times A\), the MDP during that option
- \(\beta(s)\), the probability of the option terminating at state \(s\)
one-step options
You can develop one-shot options, which terminates immediate after one action with underlying probability
- \(I = \{s:a \in A_{s}\}\)
- \(\pi(s,a) = 1\)
- \(\beta(s) = 1\)
option value fuction
\begin{equation} Q^{\mu}(s,o) = \mathbb{E}\qty[r_{t} + \gamma r_{t+1} + \dots] \end{equation}
where \(\mu\) is some option selection process
semi-markov decision process
a semi-markov decision process is a system over a bunch of options, with time being a factor in option transitions, but the underlying policies still being MDPs.
\begin{equation} T(s’, \tau | s,o) \end{equation}
where \(\tau\) is time elapsed.
because option-level termination induces jumps between large scale states, one backup can propagate to a lot of states.
intra option q-learning
\begin{equation} Q_{k+1} (s_{i},o) = (1-\alpha_{k})Q_{k}(S_{t}, o) + \alpha_{k} \qty(r_{t+1} + \gamma U_{k}(s_{t+1}, o)) \end{equation}
where:
\begin{equation} U_{k}(s,o) = (1-\beta(s))Q_{k}(s,o) + \beta(s) \max_{o \in O} Q_{k}(s,o’) \end{equation}