Big Idea
Motivation: Touring Test
Point of the Turing test: we can use language to get the underlying thought and the underlying cognition. However, language IS NOT thought.
Language is not thought
Good at language != good at thought => public speaking settings
Bad at language != bad at thought => language models?
LLM eval should separate language and thought
- formal vs functional linguistic confidence (good at speaking and speaking is useful)
- generalized world knowledge
Detour and motivation: cognitive science
- language centre in brain is specific to a location, and changes in language doesn’t change what region gets activated
- language shows little/no response when we are thinking of cognitively challenging tasks lik emaffs
Key examples: aphasics can still think. So each skill is in a separate brain-place.
Formal and Functional Competence
Mahowold, Ivanlova, et al.
Can we find out what parts of the network separately process core language skills (syntax and formal grammar) vs. “functional” skills (semantics and mathematical reasoning), and how does LLMs perform in each area?
Formal Competence
Unsurprisingly for us but surprisingly for linguists, GPT can grammar real good. No surprises there.
Functional Competence
It can memorize the output, and it doesn’t perform well for out of sample math/reasoning cases
Generalized World Knowledge
Two types of knowledge
- Factual: Paris is the capital of France; Birds lay eggs
- Distributional: the sky is {blue, black, pink}. the first two are largely more likely
The factual side LLMs are very good at and that’s again unsurpsininlgy. But for #2…
LLM embeddings have similar colours close together, and similar animals close together. LLMs are subject to reporter bias: we talk less about obvious things, yet LLMs are only trained on what we talk about.
Question 1: does the generalized world model require languages?
“The fox is chasing a planet.” — there is a logical failure here.
However, when shown semantically incompatible events, the language centre activates, but not very much. Doing this to aphasics still showed that having no difference.
SO: Language network is recruited but not required for event semantics
Event knowledge evaluations:
- This is actually formal and so LLMs do very well: “The laptop ate the teacher” (inanimate objects cannot eat, formal issue)
- This is perceptual and LLMs do poorly: “The fox chased the rabbit” (foxes can’t be slower than a fox)
Question: what happens if you in-context complete the recruitment of counterfactuals