HybPlan

“Can we come up a policy that, if not fast, at least reach the goal!”

Background

Stochastic Shortest-Path

we are at an initial state, and we have a series of goal states, and we want to reach to the goal states.

We can solve this just by:

value iteration
simulate a trajectory and only updating reachable state: RTDP, LRTDP
MBP

Problem

MDP + Goal States

\(S\): set of states
\(A\): actions
\(P(s’|s,a)\): transition
\(C\): reward
\(G\): absorbing goal states

Approach

Combining LRTDP with anytime dynamics

run GPT (not the transformer, “General Planning Tool”, think LRTDP) exact solver
use GPT policy for solved states or visited more than a certain threshold
uses MBP policy for other states
policy evaluation for convergence

“use GPT solution as much as possible, and when we haven’t ever visited a place due to the search trajectories, we can use MBP to supplement the solution”