Kenneth Li

I completed my PhD in May 2025.

I was a PhD student at Harvard, fortunately advised by Martin Wattenberg, Fernanda Viégas, and Hanspeter Pfister. I was funded by Kempner Institute Graduate Fellowship. I held a graduate-student Superalignment Fast Grant from OpenAI and interned at MSR Asia and Meta AI.

Contact: likenneth.ai [at] gmail.com
Google Scholar | Twitter | GitHub | LinkedIn

Selected Research Articles

	Communicating Activations Between Language Model Agents Vignav Ramesh, Kenneth Li ICML 2025 Language models communicate cheaper, faster, and better with activations than with natural language.
	When Bad Data Leads to Good Models Kenneth Li, Yida Chen, Fernanda Viégas, Martin Wattenberg ICML 2025 A little bit of toxic data in pretraining can act as a catalyst for more alignable language models.
	Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner Kenneth Li, Yiming Wang, Fernanda Viégas, Martin Wattenberg preprint Arxiv \| Code A small planner model is trained using reinforcement learning to steer large language models in multi-round dialogues.
	Designing a Dashboard for Transparency and Control of Conversational AI Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, Fernanda Viégas, Martin Wattenberg preprint Arxiv \| Code \| Project Page \| The Atlantic We design and evaluate a dashboard interface for visualizing and controlling the internal user model in a conversational LLM.
	Measuring and Controlling Instruction (In)Stability in Language Model Dialogs Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg COLM, 2024 (Oral) Arxiv \| Code \| The Gradient When a dialogue goes long, a chatbot ceases to follow its system prompt surprisingly quickly—within 8 rounds.
	Q-Probe: A Lightweight Approach to Reward Maximization for Language Models Kenneth Li, Samy Jelassi, Hugh Zhang, Sham Kakade, Martin Wattenberg, David Brandfonbrener ICML, 2024 Arxiv \| Code Through rejection sampling, we leverage a language model's own discriminative capability to boost its generative capability.
	Inference-Time Intervention: Eliciting Truthful Answers from a Language Model Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg NeurIPS, 2023 (Spotlight) Arxiv \| Code \| Stand-alone Model By manipulating the activations of a language model, we can compel it to tell the truth it knows but otherwise hides.
	Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task Kenneth Li, Aspen Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg ICLR, 2023 (Oral) Arxiv \| Code \| Demo \| The Gradient \| Scientific American \| The Atlantic \| Nature News \| Andrew Ng In a transformer trained on Othello transcripts, we uncover an interpretable and controllable world model of the game board.

Latest update: May 2025