Same core algorithm as Forward Search, but instead of calculating a utility based on the action-value over all possible next states, you make \(m\) different samples of next state, action, and reward, and average them
Same core algorithm as Forward Search, but instead of calculating a utility based on the action-value over all possible next states, you make \(m\) different samples of next state, action, and reward, and average them