The Gradient @thegradient

**Daniel Paleka** @dpaleka · Jan 2, 2023

Jan 2, 2023

Daniel Paleka @dpaleka

What happened last month in AI/ML safety research. (1/9)

**Daniel Paleka** @dpaleka · Jan 2, 2023

Jan 2, 2023

Daniel Paleka @dpaleka

LM-written evaluations for LMs. Automatically generating behavioral questions helps discover hard-to-measure phenomena. Larger RLHF models exhibit harmful self-preservation preferences, and *sycophancy*: insincere agreement with user’s sensibilities https://www.anthropic.com/model-written-evals.pdf (2/9)

Daniel Paleka @dpaleka@sigmoid.social

Discovering latent knowledge from model activations, unsupervised. The “truth vector” of a sentence is a direction in the latent space, solving a functional eq. New method finds truth even when the LM is prompted to lie in the output. Hope for ELK? https://arxiv.org/abs/2212.03827 (3/9)

Jan 02, 2023, 05:24 PM··Mastodon Twitter Crossposter

1boost·1favorite

**Daniel Paleka** @dpaleka · Jan 2, 2023

Jan 2, 2023

Daniel Paleka @dpaleka

LMs as agent simulators. The model approximates the beliefs and intentions of an agent that would produce the context, and uses that to predict the next token. When there is no context, the agent gets specified iteratively through sampling. (4/9) https://arxiv.org/abs/2212.01681

arXiv.orgLanguage Models as Agent ModelsLanguage models (LMs) are trained on collections of documents, written by individual human agents to achieve specific goals in an outside world. During training, LMs have access only to text of these documents, with no direct evidence of the internal states of the agents that produced them -- a fact often used to argue that LMs are incapable of modeling goal-directed aspects of human language production and comprehension. Can LMs trained on text learn anything at all about the relationship between language and use? I argue that LMs are models of intentional communication in a specific, narrow sense. When performing next word prediction given a textual context, an LM can infer and represent properties of an agent likely to have produced that context. These representations can in turn influence subsequent LM generation in the same way that agents' communicative intentions influence their language. I survey findings from the recent literature showing that -- even in today's non-robust and error-prone models -- LMs infer and use representations of fine-grained communicative intentions and more abstract beliefs and goals. Despite the limited nature of their training data, they can thus serve as building blocks for systems that communicate and act intentionally.

**Daniel Paleka** @dpaleka · Jan 2, 2023

Jan 2, 2023

Daniel Paleka @dpaleka

Economic impacts of AI in R&D. Human scientist labor will be much less important; capital (compute) might become the bottleneck. This implies fast growth because AI capabilities -> capital accumulation -> more AI, continuously (5/9) https://arxiv.org/abs/2212.08198

**Daniel Paleka** @dpaleka · Jan 2, 2023

Jan 2, 2023

Daniel Paleka @dpaleka

Mechanistic Interpretability Explainer & Glossary. Neel Nanda created a wiki of all current research on the inner workings of transformer LMs. Very comprehensive introduction to interpretability research for beginners. https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J (6/9)

**Daniel Paleka** @dpaleka · Jan 2, 2023

Jan 2, 2023

Daniel Paleka @dpaleka

Efficient DL dangers. Quantized/pruned/distilled on-device models -> adversarial risk. On-device adv examples transfer to server-side models, black-box attacks possible. Solution: Similarity unpairing to make the big and on-device models less similar (7/9) https://arxiv.org/abs/2212.13700

**Daniel Paleka** @dpaleka · Jan 2, 2023

Jan 2, 2023

Daniel Paleka @dpaleka

Jan Leike is optimistic about the OpenAI alignment approach. Evidence: “outer alignment overhang” in InstructGPT, self-critiquing and RLHF just work, ....
Goal: align an automated alignment researcher. Evaluation is key (8/9) https://aligned.substack.com/p/alignment-optimism

Musings on the Alignment ProblemWhy I’m optimistic about our alignment approachBy Jan Leike

**Daniel Paleka** @dpaleka · Jan 2, 2023

Jan 2, 2023

Daniel Paleka @dpaleka

RL from AI Feedback. Start with a “constitution” of principles. AI answers and revises lots of prompts, picks best answers via CoT to follow the principles. Then train a reward model and continue as in RLHF. Better than RLHF, using no human feedback https://www.anthropic.com/constitutional.pdf (9/9)