Stochastic Methods
Stochastic Methods is where you use randomness strategically to escape local minima. This typically rely on pseudo-random generation.
Noisy Descent
Gradient descent but slightly bad:
\begin{equation} \bold{x} = \bold{x} + \alpha \nabla f_{\bold{x}} + \epsilon \end{equation}
where:
\begin{equation} \epsilon \sim \mathcal{N}(0, \lambda I) \end{equation}
Mesh Adaptive Direct Search
This is like Generalized Pattern Search, but instead of a fixed positive spanning set we change the directions of the span vectors every single step; once randomized, then we expand/shrink as needed.
simulated annealing
Cross Entropy Method
We can also do Cross Entropy Method: choose an initial distribution and sample; refit distribution on the \(k\) best points, start again with sampling.
Natural Evolution
Natural Evolution is Cross Entropy Method but more nuanced. Consider a design point sampling distribution parameterized by \(\theta\), our objective:
\begin{equation} \min_{x \sim p(\cdot | \theta)} \mathbb{E} [f(x)] \end{equation}
(“minimize the expectation of the objective function over sampled points from that sampling distribution”)
To perform this minimization, we use gradient descent. Writing this out:
\begin{align} \nabla_{\theta}\mathbb{E} [f(x)] &=\nabla_{\theta} \int p(x|\theta) f(x) \dd{x} \\ &= \int \frac{p(x|\theta)}{p(x|theta)} \nabla_{\theta} p(x|\theta) f(x) \dd{x} \\ &= \int p(x|\theta) \qty(\frac{1}{p(x|\theta)}\nabla_{\theta} p(x|\theta)) f(x) \dd{x} \\ &= \int p(x|\theta) \qty(\nabla_{\theta} \log p(x|\theta)) f(x) \dd{x} \\ &= \mathbb{E}_{x \sim p(\cdot | \theta)} \qty[f(x) \nabla_{\theta}\log p(x|\theta)] \\ &\approx \frac{1}{m} \sum_{i}^{} f(x) \nabla_{\theta} \log p(x|\theta) \end{align}
where we used the “log derivative trick” (i.e. the chain rule with the log)
Notice: this works for non differentiable \(f\)!!!!!
And then we just do gradient descent with this instead of refitting points like in Cross Entropy Method.
Population Method
Population Methods optimize, instead of a single design point, a cohort of design points.
Typically, we use this by first using Population Methods to get a generally good point, then, on each step, we fine tune this via iterated descent; two main idea:
- Lamarckian Learning: do some Population Method, and then perform regular descent on each individual
- Baldwinian Learning: do some descent method, and weight the fitness by the values from the descent (this preserve the diversity)
Generic Algorithms
- every single design is represented by a sequence of chromosomes (it sometimes use a binary representation, but real valued vectors work well as well)
- select the fittest individuals
- truncation selection: cutting off most of the lowest performers
- tournament selection: select fittest out of \(k\) randomly chosen individuals
- roulette wheel selection: sample individuals with probability proportional to their fitness
- children forming (from fittest)
- single-point crossover: randomly select a point between the chromosome sequence, then generate a child design by taking the chunk before and after that point from each parent
- two-point crossover: the above, but only cross over a single chunk
- uniform crossover: choose elements from parents with 50% probability
- population mutation
- randomly flip each bit/reinitialize some values
Particle Swarm Optimization
for each individual in the papulation, we keep track of
- current position
- current velocity
- best position so far by the current particle
- best position seen by all particles
\begin{equation} x^{(i)} = x^{(i)} + v^{(i)} \end{equation}
\begin{equation} v^{(i)} = w v^{(i)} + c_1 r_1 \qty(x^{(i)}_{\text{your best}} - x^{(i)}) + c_2 r_2 \qty(x_{\text{global best}} - x^{(i)}) \end{equation}
basically, this is momentum where we try to go both towards our local optima that we’ve seen and also in a little towards the direction that the entire swarm has seen