How do we deal with Markov Decision Process solution with continuous state space?
Let there be a value function parameterized on \(\theta\):
\begin{equation} U_{\theta}(s) \end{equation}
Let us find the value-function policy of this utility:
\begin{equation} \pi(s) = \arg\max_{a} \qty(R(s,a) + \gamma \sum_{s’}^{} T(s’|s,a) U_{\theta}(s’)) \end{equation}
We now create a finite sampling of our state space, which maybe infinitely large (for instance, continuous):
\begin{equation} S \in \mathcal{S} \end{equation}
where, \(S\) is a set of discrete states \(\{s_1, \dots, s_{m}\}\).
Now, what next?
generally:
Loop until convergence:
- Initialize \(u_{\theta}\)
- For all \(s_{i} \in S\), let \(u_{i} = \max_{a} R(s,a) + \gamma \sum_{s’}^{}T(s’|s,a) u_{\theta}(s’)\), the utility at those discrete state samples \(s_{i}\)
- Then, fit a \(\theta\) so that \(U_{\theta}(s_{i})\) is close to \(u_{i}\)
to get \(T\): get a finite sampling of next states, or fit a function to it.
BUT: Convergence is not guaranteed.
There are two main specific approaches to achieve this:
global approximation
- linreg a best-fit line of state value vs. utility value
- polynomial fit a best-fit line, whereby \(U_{\theta}(s) = \theta^{T}\beta(s)\), where each \(\beta_{j}(s)=s^{j-1}\).
- a frigin neural network (train a model with parameters \(\theta\) which produces the utility calculations for you \(M_{\theta}(s) = U_{\theta}(s)\))
local approximation
- make a sampling in your continuous state space to discretized it
- do any utility function thing you’d like (policy evaluation or value iteration) to get some set of \(\theta_{i}\), which is the utility for being in each sampled discrete state \(s_{i}\)
- whenever you need to calculate \(U(s)\) of a particular state…
- linearly interpolate
- k nearest neighbor
- kernel smoothing