Memoryless policy search through fake determinism.
- uses a deterministic simulative function to calculate the value
- performs policy search by using normal standard optimizations
Primary contribution: transforming stochastic POMDP to a deterministic simulative function; foregos alpha vectors.
Suppose you have \(m\) initial states that you sampled, you can then just try to get the set of acions that maximize:
\begin{equation} \arg\max_{\theta} \tilde{V} = \frac{1}{m} \sum_{n}^{m} V_{\theta}(s_{m}) \end{equation}
To actually ensure that \(V\) has deterministic transitions…
deterministic simulative function
Typically, a generative model takes random actions from the action distribution. However, what we do is have a simulator which takes a RANDOM NUMBER as INPUT, and also the action distribution, and DETERMINISTICALLY give an action.
Pegasus procedure
We augment the state:
\begin{equation} s \in (S, \mathbb{R}^{[0,1]}, \mathbb{R}^{[0,1]}, \dots) \end{equation}
meaning every state is a state against a series of random numbers between \(0\) and \(1\):
\begin{equation} (s, 0.91, 0.22, \dots) \end{equation}
at every transition, we eat up one of the random numbers to use, and take an action, and use those in our deterministic simulative function to obtain our next state.
determinism
The idea is that if we have sampled enough initial states, the correct action trajectory which maximizes the deterministic \(\tilde{V}\) will also maximize that for the real \(V\).