Approximate Value Function

How do we deal with Markov Decision Process solution with continuous state space?

Let there be a value function parameterized on \(\theta\):

\begin{equation} U_{\theta}(s) \end{equation}

Let us find the value-function policy of this utility:

\begin{equation} \pi(s) = \arg\max_{a} \qty(R(s,a) + \gamma \sum_{s’}^{} T(s’|s,a) U_{\theta}(s’)) \end{equation}

We now create a finite sampling of our state space, which maybe infinitely large (for instance, continuous):

\begin{equation} S \in \mathcal{S} \end{equation}

where, \(S\) is a set of discrete states \(\{s_1, \dots, s_{m}\}\).

Now, what next?

generally:

Loop until convergence:

Initialize \(u_{\theta}\)
For all \(s_{i} \in S\), let \(u_{i} = \max_{a} R(s,a) + \gamma \sum_{s’}^{}T(s’|s,a) u_{\theta}(s’)\), the utility at those discrete state samples \(s_{i}\)
Then, fit a \(\theta\) so that \(U_{\theta}(s_{i})\) is close to \(u_{i}\)

to get \(T\): get a finite sampling of next states, or fit a function to it.

BUT: Convergence is not guaranteed.

There are two main specific approaches to achieve this:

linreg a best-fit line of state value vs. utility value
- polynomial fit a best-fit line, whereby \(U_{\theta}(s) = \theta^{T}\beta(s)\), where each \(\beta_{j}(s)=s^{j-1}\).
a frigin neural network (train a model with parameters \(\theta\) which produces the utility calculations for you \(M_{\theta}(s) = U_{\theta}(s)\))

make a sampling in your continuous state space to discretized it
do any utility function thing you’d like (policy evaluation or value iteration) to get some set of \(\theta_{i}\), which is the utility for being in each sampled discrete state \(s_{i}\)
whenever you need to calculate \(U(s)\) of a particular state…
- linearly interpolate
- k nearest neighbor
- kernel smoothing