PGA extends controller gradient ascent to cover CPOMDPs
Notation
Recall from controller gradient ascent we have an objective which we will modify for CPOMDPs. For initial controller-states \(\beta\) and utility \(\bold{u}_{\theta}\):
\begin{equation} \max_{\theta}\ \beta^{\top} (\bold{I} - \gamma \bold{T}_{\theta})^{-1} \bold{r}_{\theta} \end{equation}
subject to:
- \(\Psi\) remains a probably distribution over \(|A|\)
- \(\eta\) remains a probably distribution over \(|X|\)
- and, new for CPOMDP, \(\beta^{\top} (\bold{I} - \gamma \bold{T}_{\theta})^{-1} C_{i} \leq \epsilon_{i}\ \forall i\), that is, each constraint \(C_{i} \in \bold{C}_{i}\) is satisfied to be lower than the budget \(\epsilon_{i}\).
where
\begin{equation} T_{\theta}((x,s), (x’,s’)) = \sum_{a} \Psi(a | x) T(s’|s,a) \sum_{o} O (o|a,s’) \eta (x’|x,a,o) \end{equation}
and
\begin{equation} R_{\theta}((x, s)) = \sum_{a} \Psi(a|x) R(s,a) \end{equation}
in which
- \(X\): a set of nodes (hidden, internal states)
- \(\Psi(a|x)\): probability of taking an action
- \(\eta(x’|x,a,o)\) : transition probability between hidden states
Optimization Formulation
we formulate policy parameters \(\theta\) as a large stacked vector of the shape:
\begin{equation} \theta = \mqty(\Psi(a_0 | x_0) \\ \Psi(a_1 | x_0) \\ \dots \\ \Psi(a_n | x_N) \\ \eta(x_0 | x_0, a_0, o_0) \\ \eta(x_1 | x_0, a_0, o_0) \\ \dots \\ \eta(x_N | x_N, a_n, o_f)) \end{equation}
Let us define block diagonal matricies \(J_{\Psi}\) and \(J_{\eta}\), whereby:
\begin{equation} J_{\Psi} = \mqty(\bold{1}_{\Psi}^{\top} & \dots & \bold{0} \\ & \dots & \\ \bold{0} & \dots &\bold{1}_{\Psi}^{\top} ) \end{equation}
where \(J_{\Psi} \in \mathbb{R}^{|X| \times (|A| \times |X|)}\) and each block part \(\bold{1}_{\Psi} \in \mathbb{R}^{|A|}\) represents a one-vector of length of the action space. You can see how multiplying a vector \(\qty[\Psi(a_0|x_0) \\ \dots \\ \Psi(a_n|x_N)]\) against this matrix should yield a \(1\) vector if each \(\Psi(\cdot | x_{i})\) is a probability distribution.
Similar, we define, \(J_{\eta}\) in a similar fashion to add the distributions over each \(\eta(\cdot | x, a, o)\).
This yields another block-block matrix
\begin{equation} J = \mqty(J_{\Psi} & 0 \\ 0 & J_{\eta}) \end{equation}
for which we desire \(J\theta = 1\) in order to verify that the probability distributions are valid.
Lastly, let us define \(\bold{Z} = (\bold{I} - \gamma \bold{T}_{\theta})\). For ease of notation in constructing this result, we declare:
\begin{equation} f(\theta) = \beta^{\top} \bold{Z}^{-1} \bold{r}_{\theta} \end{equation}
and
\begin{equation} h_{i}(\theta) = \beta^{\top} \bold{Z}^{-1} C_{i} \end{equation}
Finally, this allows us to formulate the problem as a nonlinear optimization problem:
\begin{align} \max_{\theta}\ &f(\theta) \\ \text{such that}\ &J\theta = 1 \\ & \theta \geq 0 \\ & h_{i}(\theta) \leq \epsilon_{i},\ \forall i \end{align}
Gradient Ascent Procedure
Note that the initial state information \(\beta\) is constant. Therefore, the gradient of the top expression against each field in \(\theta\) becomes, via an rearrangement of the chain rule:
\begin{equation} \pdv{f(\theta)}{\theta_{i}} = \beta^{\top} \qty[\bold{Z}^{-1} \qty( \pdv{\bold{r}_{\theta}}{\theta_{i}} + \pdv{\bold{Z}}{\theta_{i}} \bold{Z}^{-1}\bold{r}_{\theta})] \end{equation}
The derivatives of each \(\theta\) against each \(\bold{r}\) and \(\bold{Z}\) is given on pp 485 of Alg4DM.
As with all gradient ascent cases, each “step” takes the rough form of
\begin{equation} \xi = \theta + \alpha \nabla_{\theta} f(\theta) \end{equation}
however, in this implementation, the step size isn’t actually fixed. Instead, we do…
Golden Section Line Search
Instead of taking fixed-step sizes to get to the maxima, PGA proposes Golden Section Line Search as a line-search algorithm to dynamically choose the steps to get to maxima.
Line search algorithms are typically computationally heavy as it requires evaluating the relative utility (i.e. here \(\bold{Z}^{-1} \bold{r}_{\theta}\)) a lot of times, which is computationally intractable.
So, Golden Section Line Search uses a divide-and-conquer method via the golden ratio to address this issue.
def optimize!(CPOMDP, phi=golden_ratio,
gamma=discount_factor, eps=minimum_boundary):
# C-POMDP Spec
f = CPOMDP.objective_function # this is f(theta)
T = CPOMDP.transition_matrix
R = CPOMDP.reward_vector
b = CPOMDP.initial_state_vector
# as obtained above
nabla = f(theta).grad(theta)
# initialize the search bounds based on splitting
# search space (full step, no step) via the golden ratio
a1, a2, a3, a4 = 0, (1-(1/phi)), (1/phi), 1
# calculate new policies and their utilities
theta2, theta3 = theta + a2*nabla, theta + a3*nabla
z2, z3 = (I-gamma*T@theta2).inverse(), (I-gamma*T@theta3).inverse()
# search until the middle bounds converged
while (a4-a1) < eps*(abs(a2) + abs(a3)):
# calculate utility vectors over belief
u2, u3 = z2@R, z3@R
# either relax top or bottom bounds, depending on
# which one we had been successfully maximizing
if b.dot(u3) >= b.dot(u2):
# search "upwards", set bottom bound to a3
a1, a2, a3 = a2, a3, a2+(1/phi)*(a4-a2)
theta3, theta2 = theta + a3*nabla, theta3
z3, z2 = (I-gamma*T@theta2).inverse(), z3
else:
# search "downwards"
a1, a3, a2 = a2, a2, a3-(1/phi)*(a3-a1)
theta2, theta3 = theta + a2*nabla, theta2
z2, z3 = (I-gamma*T@theta2).inverse(), z2
# return the average of our converged results
return 0.5*(theta2+theta3), 0.5*(u2+u3)
Naive Projection
Once we obtain a new set of parameters \(\xi\) from Golden Section Line Search, we can’t actually directly punch it into \(\theta\). This is because it is likely not going to satisfy the constraints that are given.
We can fix this naively with a non-linear programming formulation; that is, we desire to find the closest \(\theta\) to the computed value \(\xi\); we do this by minimizing a L2 norm (sum of squared errors):
\begin{align} \min_{\theta}\ & \frac{1}{2} \| \xi - \theta \|^{2}_{2} \\ \text{such that}\ &J\theta = 1 \\ & \theta \geq 0 \\ & h_{i}(\theta) \leq \epsilon_{i},\ \forall i \end{align}
This, for the most part, is computationally intractable and needs to be computed through each iteration. This is especially bad for the \(h_{i}\) for all \(i\) part. And so, instead of doing this, we formulate instead an approximate proxy objective.
Approximate Projection
The thing that makes the objective above hard is that \(h_{i}\) doesn’t have nice convex properties. To fix this, we perform a local linearizion of \(h_{i}\).
Specifically, let’s replace \(h_{i}\) with its local Taylor expansion.
For some step where we started at \(\theta_{k}\), if you wanted to evaluate some next step \(\theta_{k+1}\) from that step, we can write:
\begin{equation} h_{i}(\theta_{k+1}) \approx h_{i}(\theta_{k}) + (\nabla_{\theta}(\theta_{0}))(\theta_{k+1}-\theta_{k}) \end{equation}
Using this linear decomposition of three parts (i.e. parameter difference from original, the gradient of \(h\) against the parameter, and the original value of \(h\)), we can now split the \(h_{i}(\theta)\) constraint of the non-linear program into a linear decomposition.
Let’s define:
\begin{equation} \nabla_{\theta} \bold{h}(\theta) = \mqty[ (\nabla_{\theta} h_{1}(\theta))^{\top} \\ \dots \\ (\nabla_{\theta} h_{m}(\theta))^{\top}] \end{equation}
From which we write block matriix
\begin{equation} \bold{A} = \mqty[-\bold{I}_{n} \\ \nabla_{\theta}\bold{h}(\theta_{k})] \end{equation}
where \(\bold{I}_{n} \in \mathbb{R}^{(|A| + |X|) \times (|A| + |X|)}\), and vector:
\begin{equation} \bold{b} = \mqty[\bold{0}_{n} \\ \epsilon - \bold{h}(\theta_{k}) + \nabla\bold{h}(\theta_{k})\theta_{k}] \end{equation}
These definitions allow us to rewrite two of our objectives:
\begin{equation} \begin{cases} \theta \geq 0 \\ h_{i}(\theta) \leq \epsilon_{i},\ \forall i \end{cases} \end{equation}
turning them instead into simply \(\bold{A}\theta \leq \bold{b}\). The top half of \(\bold{A}\), \(\bold{B}\) is responsible for making sure that all elements of \(\theta\) is positive (specifically, to ensure the negative of the values is smaller than 0); the bottom half ensures that all of them satisfy the cost.
These definitions result in a linear formulation of two objectives of our original non-linear program:
\begin{align} \min_{\theta}\ & \frac{1}{2} \| \xi - \theta \|^{2}_{2} \\ \text{such that}\ &J\theta = 1 \\ & \bold{A}\theta \leq \bold{B} \end{align}
and we are done.
Quick Tip
Recall that we have to calculate the inverse of \(\bold{Z}\) quite a lot throughout the computation of \(h\) and \(f\). For each policy parameter \(\theta\), you can cache the value of \(\bold{Z}\), L-U (upper-triangular/lower-triangular factored) and recombine them/invert them as needed to speed up computation. This ensures that you only calculate \(\bold{Z}\) once per step.