SU-CS238V JAN072025

Alignment Problem

autonomous systems will do exactly what we tell them to do… so we need to give them good instructions. This is the Alignment Problem

imperfect objective—underspecified objective
imperfect model—understanding of the world is underspecified
imperfect optimization—the model just didn’t solve the problem correctly

Validation Framework

High level structure:

validation_algorithm(system, spec)

system

environment: state of the world, \(T(s’|s,a)\)
sensor, \(O(o|s)\)
agent, policy \(\pi\qty(a | o)\)

example: inverted pendulum

state: \(\qty (\theta, \omega)\) of the pendulum
observation: \(O(o|s) = \mathcal{N}\qty (o|s,\Sigma)\), Gaussian noise
policy: consider the following proportional controller policy

\begin{equation} \pi \qty(a | o) = \begin{cases} 1, \text{if} a = -15 \tau - 8 \omega \\ 0 \end{cases} \end{equation}

environment: a \(T(s’|s,a)\) given by physics

specification \(\psi\)

Rules of the system—“do not let the pendulum tip over”. Specifications are usually given in formal specification language such as Linear Temporal Logic or Signal Temporal Logic.

validation algorithm

With input a system + specification, a Validation Algorithm provides an output in the form of one of…

Failure Analysis

Falsification: search for failures of a particular system
Failure distribution: identify what are the distirbutions of failures
Failure probability estimation: estimate the probability of failures

Formal Methods

“reachability”

Linear System reachability
Nonlinear system reachability
Discrete system reachability

Other stuff

Explanation
Runtime Assurances