online planning

For elements with large possible future state space, we can’t just iterate over all states to get a value function for every state, and THEN go about using the greedy policy to perform actions.

Therefore, we employ a technique called receding horizon planning: planning from the current state upwards to a maximum horizon \(d\), figure out what the best SINGLE action would be given that information for only this state, and then replan.

Here are the main methods of doing this:

Rollout with Lookahead: for each possible next action, sample a transition-weighted random trajectory using some policy, use whatever discounted future reward you got for that as your utility
: for each possible next action, search through each possible next action until you hit the depth required, calculate the instantaneous reward at that point, and backup until you have recorded the sequence of actions that maybe best, and then return the first action in that sequence
Branch and Bound: same algorithm as Forward Search, but you bound your search based on the theoretical upper-bound of the q-value
Sparse Sampling: same core algorithm as Forward Search, but instead of calculating a utility based on the action-value, you sample a set of possible next states and average their future utilities
monte-carlo tree search: use monte-carlo exploration function to come up with a bunch of possible actions to try, and try them with discounts as you try them

Additional Information

open-loop planning vs close-loop planning

open loop planning

Instead of doing all the methods above, which all requires state information of the future, open loop planning uses an exogenously chosen sequence of actions and tries to simply:

Maximize: \(U(a_1, …, a_{n})\)

where the choice of actions doesn’t change regardless of eventual state is.

For high dimensional systems, where is hard to do closed loop systems, this will work better.