Houjun Liu

BetaZero

Background

recall AlphaZero

Selection (UCB 1, or DTW, etc.)
Expansion (generate possible belief notes)
Simulation (if its a brand new node, Rollout, etc.)
Backpropegation (backpropegate your values up)

Key Idea

Remove the need for heuristics for MCTS—removing inductive bias

Approach

We keep the ol’ neural network:

\begin{equation} f_{\theta}(b_{t}) = (p_{t}, v_{t}) \end{equation}

Policy Evaluation

Do \(n\) episodes of MCTS, then use cross entropy to improve \(f\)

Ground truth policy

Action Selection

Uses Double Progressive Widening

Importantly, no need to use a heuristic (or worst yet random Rollouts) for action selection.

Difference vs. LetsDrive

LetsDrive uses DESPOT
BetaZero uses MCTS with belief states.