What is "Unlearning"?
Making a model forget specific knowledge while keeping everything else it knows.
Why do this?
- Remove dangerous info (bioweapons, hacking)
- Delete private data (GDPR compliance)
- Remove copyrighted content
What is "Unlearning"?
Making a model forget specific knowledge while keeping everything else it knows.
Why do this?
A summary of our conversation on understanding and building SAEs for LLM interpretability.
Neural networks like GPT are powerful but opaque. We'd like to understand them by looking at individual neurons, but there's a problem: single neurons don't correspond to single concepts. One neuron might activate for academic citations, English dialogue, HTTP requests, and Korean text all at once.
| <!DOCTYPE html> | |
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>Model Organism: Exfiltration Training Pipeline</title> | |
| <style> | |
| * { | |
| margin: 0; | |
| padding: 0; |
After exhaustive experimentation across three models (Gemma-27B, GPT-OSS-20B, Llama-8B) and three SAE architectures (GemmaScope 2, Goodfire TopK, LlamaScope 32x), we have a clear answer:
Scalable oversight is the challenge of supervising AI systems that can produce work humans can't fully verify. This becomes a critical problem as AI approaches superhuman capabilities—if an AI can generate answers, code, or strategies too complex for any human to check, how do we know it's actually being helpful and honest rather than subtly deceptive or wrong? The field has emerged as one of the central problems in AI alignment, with multiple major labs developing complementary approaches. As of early 2025, some techniques (like Constitutional AI) are already deployed in production, while others (like debate and weak-to-strong generalization) show promising experimental results but face fundamental open questions about whether they'll scale to truly superhuman systems.
Think about how we currently train AI to be helpful and safe. The standard approach—called RLHF (Reinforcement Learning from Human Fe
What I Did (For MATS Application)
The Problem
AI systems might learn to fake being helpful — acting nice when watched, but planning to misbehave later. Like an employee who's perfect when the boss is around, but slacks off otherwise. How do you catch that?
The Old Approach
| #!/usr/bin/env python3 | |
| """ | |
| Experiment 7: Full 50k Feature Sweep | |
| ===================================== | |
| Sweep ALL features, not just top 8 by correlation. | |
| Find the true needles in the haystack. | |
| Time estimate: ~8-10 hours on H100 | |
| Cost estimate: ~$20-25 on cloud GPU rental |