policy iteration will allow us to get an optimal policy.
- start with some initial policy \(\pi\) (this scheme converges to an optimal policy regardless of where you start)
- solve for \(U^{\pi}\)
- create a new policy \(\pi’\) by creating a value-function policy on \(U^{\pi}\)
- repeat 2-3
Since there are a finite policies, this will eventually converge.
At each point, the utility of the policy increases.
At each step, the utility of the resulting policy will necessarily be larger or equal to than the previous one as we are greedily choosing “better” (or equivalent) actions as measured by the utility of the previous policy.