Leshem Choshen: "But what if we pretrain instea…" - Sigmoid Social

Leshem Choshen @LChoshen

How should the humanities leverage LLMs?
> Domain-specific pretraining!

Pretraining models can be a research tool, it's cheaper than LoRA, and allows studying
- grammatical change
- emergent word senses
- and who knows what more…

Train on your data with our pipeline or use ours!
#AI #LLM #linguistics

Leshem Choshen @LChoshen

Typical Large Language Models (LLMs) are trained on massive, mixed datasets, so the model's behaviour can't be linked to a specific subset of the pretraining data. Or in our case, to time eras.

Leshem Choshen @LChoshen

Historical analysis is a good example, as historical periods can get lost in blended information from different eras. Finetuning large models isn't enough, they “leak” future/modern concepts, making historical analysis impossible. Did you know cars existed in the 1800s?

Leshem Choshen @LChoshen@sigmoid.social

But what if we pretrain instead? You can get a unique LLM trained on small time corpora (10M tokens).
“Are you crazy? It will be so costly”
Oh no, it is even more efficient than training those monster LoRAs.

Apr 15, 2025, 11:10 AM··Web

0boosts·1favorite

Leshem Choshen @LChoshen

Using pretraining, we are able to track historical shifts, like evolving negative polarity item patterns in "only...ever" and "even...ever" - something not seen in the finetuned models.
It was also cool to explore word sense change, using simple differences in surprisal metrics.

Leshem Choshen @LChoshen

Read more or use:
https://arxiv.org/abs/2504.05523
https://huggingface.co/Hplm

arXiv.orgPretraining Language Models for Diachronic Linguistic Change DiscoveryLarge language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

Drag & drop to upload