Ingredients:
- \(\mathcal{P}\) problem (states, transitions, etc.)
- \(\pi\) a Rollout Policy
- \(d\) depth (how many next states to look into)—more is more accurate but slower
Use the greedy policy at each state by using the Rollout procedure to estimate your value function at any given state.
Rollout
Rollout works by hallucinating a trajectory and calculating the reward.
For some state, Rollout Policy, and depth…
- let ret be 0; for i in range depth
- take action following the Rollout Policy
- obtain a sample of possible next state (weighted by the action you took, meaning an instantiation of \(s’ \sim T(\cdot | s,a)\)) and reward \(R(s,a)\) from current state
- ret += gamma^i * r
- return ret
Rollout Policy
A Rollout Policy is a default policy used for lookahead. Usually this policy should be designed with domain knowledge; if not, we just use a uniform random next steps.