We begin with a policy parameterized on anything you’d like with random seed weights. Then,
- We sample a local set of parameters, one pertubation \(\pm \alpha\) per direction in the parameter vector (for instance, for a parameter in 4-space, up, down, left, right in latent space), and use those new parameters to seed a policy.
- Check each policy for its utility via monte-carlo policy evaluation
- If any of the adjacent points are better, we move there
- If none of the adjacent points are better, we set \(\alpha = 0.5 \alpha\) (of the up/down/left/right) and try again
We continue until \(\alpha\) drops below some \(\epsilon\).
Note: if we have billions of parameters, this method will be not that feasible because we have to calculate the Roll-out utility so many many many times.