sigmoid.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A social space for people researching, working with, or just interested in AI!

Server stats:

596
active users

What happened last month in AI/ML safety research. 🧵(1/9)

LM-written evaluations for LMs. Automatically generating behavioral questions helps discover hard-to-measure phenomena. Larger RLHF models exhibit harmful self-preservation preferences, and *sycophancy*: insincere agreement with user’s sensibilities anthropic.com/model-written-ev (2/9)

Daniel Paleka

Discovering latent knowledge from model activations, unsupervised. The “truth vector” of a sentence is a direction in the latent space, solving a functional eq. New method finds truth even when the LM is prompted to lie in the output. Hope for ELK? arxiv.org/abs/2212.03827 (3/9)

LMs as agent simulators. The model approximates the beliefs and intentions of an agent that would produce the context, and uses that to predict the next token. When there is no context, the agent gets specified iteratively through sampling. (4/9) arxiv.org/abs/2212.01681

arXiv.orgLanguage Models as Agent ModelsLanguage models (LMs) are trained on collections of documents, written by individual human agents to achieve specific goals in an outside world. During training, LMs have access only to text of these documents, with no direct evidence of the internal states of the agents that produced them -- a fact often used to argue that LMs are incapable of modeling goal-directed aspects of human language production and comprehension. Can LMs trained on text learn anything at all about the relationship between language and use? I argue that LMs are models of intentional communication in a specific, narrow sense. When performing next word prediction given a textual context, an LM can infer and represent properties of an agent likely to have produced that context. These representations can in turn influence subsequent LM generation in the same way that agents' communicative intentions influence their language. I survey findings from the recent literature showing that -- even in today's non-robust and error-prone models -- LMs infer and use representations of fine-grained communicative intentions and more abstract beliefs and goals. Despite the limited nature of their training data, they can thus serve as building blocks for systems that communicate and act intentionally.

Economic impacts of AI in R&D. Human scientist labor will be much less important; capital (compute) might become the bottleneck. This implies fast growth because AI capabilities -> capital accumulation -> more AI, continuously (5/9) arxiv.org/abs/2212.08198

Mechanistic Interpretability Explainer & Glossary. Neel Nanda created a wiki of all current research on the inner workings of transformer LMs. Very comprehensive introduction to interpretability research for beginners. dynalist.io/d/n2ZWtnoYHrU1s4vn (6/9)

Efficient DL dangers. Quantized/pruned/distilled on-device models -> adversarial risk. On-device adv examples transfer to server-side models, black-box attacks possible. Solution: Similarity unpairing to make the big and on-device models less similar (7/9) arxiv.org/abs/2212.13700

Jan Leike is optimistic about the OpenAI alignment approach. Evidence: “outer alignment overhang” in InstructGPT, self-critiquing and RLHF just work, ....
Goal: align an automated alignment researcher. Evaluation is key (8/9) aligned.substack.com/p/alignme

Musings on the Alignment ProblemWhy I’m optimistic about our alignment approachBy Jan Leike

RL from AI Feedback. Start with a “constitution” of principles. AI answers and revises lots of prompts, picks best answers via CoT to follow the principles. Then train a reward model and continue as in RLHF. Better than RLHF, using no human feedback anthropic.com/constitutional.p (9/9)