Knowledge Editing

controllable

We want \(P(Y|X) = p\), for a specific \(p\) that we specify.

fine-grained control

ideally, instead of optimizing over entire expected values, we want to tune specific utputs

Success in Editing

Say we edited some \(M\), specifying a viper is a vertebrate.

Ideally, this should also edit the other related information:

\(P\) (paraphrases)j: viper and vertebrates
\(E\) (logical entailments): a viper has a brain

And we shouldn’t touch:

\(R\) (other stuff): Chile is a country
\(LN\) (local neural data): a viper is venomous

Hypernetwork Weight Editing’s Drawbacks

harder to fix errors than creating them
harder to retain preformance on local data than random data
hander to generalize to entailed data than paraphrases
Updates improves consistency

Information Deletion

“deleting information” from LLMs is undefined
RLHF, SFT, etc. HIDES rather than ddeleting
this can be framed as model editing

High Level Approach

notice threat information
attempt to “delete it”
evaluate the deletion
try to extract the threat information again
loop

We formalize this by saying, for some adversarial \(A\) to question \(Q\), we hope that the candidate output set \(C\) of size \(B\) all don’t contain \(A\).

Formal guarantees don’t work very well in LLMWorld.

Ideally, we balance attack success and the damage to other aspects from the model.

Supervision Gap Recovered

Measuring the ratio between the rate of success of “easy” supervision data over “hard” supervisiation data.