Sarsa (Lambda) is SARSA with Eligibility Traces (\(\lambda\)).
Previous approaches to deal with Partially Observable Markov Decision Process:
- memory-based state estimation (beliefs)
- special planning methods
Key question: Can we use MDP reinforcement learning to deal with POMDPs?
Background
Recall MDP SARSA:
\begin{equation} Q(s,a) \leftarrow Q(s,a) + \alpha \qty [(r + \gamma Q(s’, a’)) - Q(s,a)] \end{equation}
Recall that, sparse rewards with SARSA can take a long time to learn because it takes time to backpropgate.
Hence, we use Eligibility Traces, which keeps track of what’s “eligible” for updates:
let \(\lambda\) be some decay parameter, we have:
\begin{equation} \delta = r + \gamma Q(s’,a’) - Q(s,a) \end{equation}
and, we can write:
\begin{equation} Q(s,a) \leftarrow Q(s,a) + \lambda \delta N(s,a) \end{equation}
where by the visit counts are discounted such that:
\begin{equation} N(s,a) \leftarrow \gamma \lambda N(s,a) \end{equation}
Worry
Inability of fully observing the state seems to invalidate the point of \(Q\) learning, SARSA, etc.
Applying Eligibility Traces to POMDPs
Instead of \(N(s,a)\) a visitation count, we initialize some \(\eta(x,a)\) for observation + action and work on the rest of it in the same way.
At each step, we sample some reward \(r_{t+1}\) and observation \(x_{t+1}\) (and remember our current \(r_{t}, x_{t}\)). Then in which case:
\begin{equation} \eta(x_{t},a_{t}) = 1\ \text{(reset decay)} \end{equation}
and
\begin{equation} \forall (x\neq x_{t}, a \neq a_{t}): \eta(x,a) = \gamma \lambda \eta_{t-1}(x,a)\ \text{(decay others)} \\ \end{equation}
Using the new \(\eta\) values, we update \(Q(x, a)\) in the usualish manner:
\begin{equation} \delta_{t} := r_{t} + \gamma Q (x_{t+1}, a_{t+1}) - Q(x_{t}, a_{t}) \end{equation}
\begin{equation} \forall (x,a): Q(x,a) = Q(x,a)+a \delta_{t} \eta(x,a) \end{equation}
all values of \(\eta\) are established to \(0\) per episode.
Notably, we formulate \(x\) as a tuple of TWO observations in the past—meaning we have a single step of memory in the past and optimise over those.
This requires no belief propagation!! And note how the “eligibility” of each observation decays over time, such that the influence of that observation decays until we see the corresponding observation again.
aliasing
an important failure of this is aliasing–where you maybe in one of two different places that has similar properties observationally, but taking actions at those states results in very different places.