

Discover more from Nina’s Substack
Last month, I wrote about participating in SERI MATS, an AI safety research fellowship where more experienced professionals in AI safety mentor junior researchers to build research skills and potentially produce new findings.
Overall, I enjoyed the program and moved further toward wanting to do AI safety research as a career. One of the most significant benefits of SERI MATS was working from the same office as other people doing AI safety research, which resulted in many fruitful and interesting discussions during mealtimes and opportunities for research collaborations. Additionally, the opportunities to interact with more experienced researchers in formal and informal settings helped build context on what people were working on and get high-quality critical feedback on both early-stage ideas and current research. I recommend the program to other early-career researchers wanting to get deeper into technical AI safety/alignment research.
I plan to continue the research I started during the program. In particular, I am pursuing the following questions and next steps:
Investigating the general properties of activation steering
Over the last two months, I, alongside a few other SERI MATS fellows, have performed a bunch of experiments demonstrating that it is possible to predictably steer the behavior of language models by perturbing activations in the residual stream using a vector generated from the difference in mean activations between a bunch of positive and negative examples of this behavior.
I found that this works particularly well if the data used is formatted as a multiple-choice question where the only difference between positive and negative examples is the token corresponding to the answer (e.g., “A” or “B”). This seems to be because formatting this way concentrates all relevant context encoding of that behavior at one token position. Then, we can easily extract activations for steering vector generation at that token position in particular.

There remain adjacent questions that I want to answer, including:
How does activation steering compare to finetuning biases and low-rank finetuning of specific layer weights?
What happens if we scale the technique to a larger open-source RLHF model such as llama-2-70b-chat? Does it become easier or harder to elicit behaviors via activation steering?
Developing activation steering as a red-teaming technique
Based on my activation steering findings, I’ve proposed a red-teaming approach that does not rely on generating prompts to cause the model to fail on some benchmark by instead linearly perturbing residual stream activations at one layer.
I plan to mentor a small cohort of students to work on developing this approach over the next few months as part of the Supervised Program for Alignment Research.
Creating an open-source influence function approximation library
Anthropic recently published the paper Studying Large Language Model Generalization with Influence Functions, where they developed a scalable technique for quantifying a particular training example's influence on a model’s weights/outputs.
This could help us study emergent behaviors and important properties of powerful AI systems, such as deceptiveness and power-seeking, by enabling us to understand better how the model generalized its capabilities from the training data.
From Anthropic’s blogpost on this research:
By working with different models of size 810 million, 6.4 billion, 22 billion, and 52 billion parameters, we’ve identified influential training sequences for a variety of model outputs. Perhaps the most striking trend is that the patterns of generalization become more abstract with model scale. Consider, for instance, the influence query shown below, where a model expressed a desire not to be shut down. For the 810 million parameter model, the most influential sequences (i.e. the ones which our algorithm thinks would most increase the probability of giving this particular response) shared overlapping sequences of tokens (e.g. “continue existing”), but were otherwise irrelevant. For the 52 billion parameter model, the most influential sequences were more conceptually related, involving themes like survival instinct and humanlike emotions in AIs.
Therefore, I would be excited to see an open-source library that enables AI safety researchers to run similar influence function experiments on other models. I am working on putting together an initial version of something like this (work in progress code is here).
Scaling up Hessian analysis / basis-agnostic interpretability approaches
Throughout SERI MATS, I worked with Dmitry Vaintrob to make progress on a new approach to interpretability that relies on analyzing the loss basin of a trained model to find important directions in weight space that correspond to separable submodules of the network.
We want to scale up the approach to larger networks and attempt a harder challenge than separating our hybrid MNIST task, for instance, the Happy Faces benchmark.