|
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
preprint
Arxiv | Code
When a conversation goes long, a personalized chatbot quickly ceases to follow its system prompt (within 8 rounds).
|
|
Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
Kenneth Li, Samy Jelassi, Hugh Zhang, Sham Kakade, Martin Wattenberg, David Brandfonbrener
preprint
Arxiv | Code
Through rejection sampling, we leverage a language model's own discriminative capability to boost its generative capability.
|
|
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li*, Oam Patel*, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
NeurIPS, 2023 (Spotlight)
Arxiv | Code | Stand-alone Model
By manipulating the activations of a language model, we can compel it to tell the truth it knows but otherwise hides.
|
|
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
Kenneth Li, Aspen Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
ICLR, 2023 (Oral)
Arxiv | Code | Demo | The Gradient | Scientific American | The Atlantic | Nature News | Andrew Ng
In a transformer trained on Othello transcripts, we uncover an interpretable and controllable world model of the game board.
|
|