Option (MDP)

an Option (MDP) represents a high level collection of actions. Big Picture: abstract away your big policy into \(n\) small policies, and value-iterate over expected values of the big policies.

Markov Option

A Markov Option is given by a triple \((I, \pi, \beta)\)

\(I \subset S\), the states from which the option maybe started
\(S \times A\), the MDP during that option
\(\beta(s)\), the probability of the option terminating at state \(s\)

one-step options

You can develop one-shot options, which terminates immediate after one action with underlying probability

\(I = \{s:a \in A_{s}\}\)
\(\pi(s,a) = 1\)
\(\beta(s) = 1\)

option value fuction

\begin{equation} Q^{\mu}(s,o) = \mathbb{E}\qty[r_{t} + \gamma r_{t+1} + \dots] \end{equation}

where \(\mu\) is some option selection process

semi-markov decision process

a semi-markov decision process is a system over a bunch of options, with time being a factor in option transitions, but the underlying policies still being MDPs.

\begin{equation} T(s’, \tau | s,o) \end{equation}

where \(\tau\) is time elapsed.

because option-level termination induces jumps between large scale states, one backup can propagate to a lot of states.

intra option q-learning

\begin{equation} Q_{k+1} (s_{i},o) = (1-\alpha_{k})Q_{k}(S_{t}, o) + \alpha_{k} \qty(r_{t+1} + \gamma U_{k}(s_{t+1}, o)) \end{equation}

where:

\begin{equation} U_{k}(s,o) = (1-\beta(s))Q_{k}(s,o) + \beta(s) \max_{o \in O} Q_{k}(s,o’) \end{equation}