The Gradient @thegradient

Leshem Choshen @LChoshen@sigmoid.social

Do LLMs learn foundational concepts required to build world models? (less than expected)

We address this question with EWoK (Elements of World Knowledge)

a flexible cognition-inspired framework to test knowledge across physical and social domains

https://ewok-core.github.io

#llm #llms #evaluation

May 17, 2024, 04:32 PM··Web

5boosts·6favorites

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

We leveraged the cognitive science literature to select concepts from 11 domains of knowledge, from social interactions (HELP, HINDER, CHASE, EVADE) to spatial relations (LEFT, RIGHT, CLOSE, FAR)

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

We then designed *templates* probing the knowledge of these concepts. Each template has 2 contexts (C) and 2 targets (T), such that T1 is plausible given C1 but not C2 and vice versa

This way, we are eliminating simple reliance on the option that’s more common in training data

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

A template can be completed with different fillers, generating many unique *items*

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

Using the EWoK framework, we generated the EWoK-core-1.0 dataset comprising 4,374 items. We then tested 20 open-weights LLMs (1.3B-70B params) and compared them with humans (12,480 measurements)

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

Our easiest domain was social interactions (HELP, HINDER, CHASE, EVADE), followed by social properties (FRIENDLY, HOSTILE, HUMBLE, BOASTFUL)

Bars = LLMs; dots = dataset versions (different fillers). Chance 0.5

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

Our hardest domains are physical relations (CONTAIN, SUPPORT, BLOCK) and spatial relations (LEFT, RIGHT, CLOSE, FAR). Human performance (shaded in gray, range = performance on diff versions) is also below perfect but still much better than LLMs

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

Overall, performance improves with latest & larger models but not by much! Domains that are hard are mostly hard for all models (but not necessarily for humans)

See paper for additional controls and result nuggets

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

Lots of new places to go from here! More domains, targeted experiments on the effects of fillers/context-target links, locating domain knowledge inside LLMs…

We make the EWoK data generation framework available for all, feel free to build on it!

https://github.com/ewok-core/ewok

GitHubGitHub - ewok-core/ewok: EWoK dataset generation frameworkEWoK dataset generation framework. Contribute to ewok-core/ewok development by creating an account on GitHub.

**Leshem Choshen** @LChoshen · May 17, 2024

May 17, 2024

Leshem Choshen @LChoshen

We spent a lot of time thinking about protecting EWoK from immediately contaminating the training data. Ultimately, the responsibility lies with you!
Do not openly post the items/templates online AND report whenever you train/finetune a model on EWoK-generated items