BigsnarfDude bigsnarfdude

What is "Unlearning"?

Making a model forget specific knowledge while keeping everything else it knows.

Why do this?

Remove dangerous info (bioweapons, hacking)
Delete private data (GDPR compliance)
Remove copyrighted content

Sparse Autoencoders (SAEs) Study Guide

A summary of our conversation on understanding and building SAEs for LLM interpretability.

What Problem Do SAEs Solve?

Neural networks like GPT are powerful but opaque. We'd like to understand them by looking at individual neurons, but there's a problem: single neurons don't correspond to single concepts. One neuron might activate for academic citations, English dialogue, HTTP requests, and Korean text all at once.

Project Lightbright: January 8, 2026

Where We Are & Where We Go From Here

Part 1: Where We Are

The Confirmed Negative Result

After exhaustive experimentation across three models (Gemma-27B, GPT-OSS-20B, Llama-8B) and three SAE architectures (GemmaScope 2, Goodfire TopK, LlamaScope 32x), we have a clear answer:

Plan: Building Minimal Pairs for AF Intent Detection

Phase 0: Define Success Criteria

Before building anything:

Define what "same vocabulary" means (exact tokens? same key phrases? similar word distribution?)
Define what "different intent" means operationally (how will you verify?)

Research Directions Worth Investing In

Based on your findings

Tier 1: High value, builds on your work

Direction	Why

Scalable oversight: How to supervise AI that's smarter than you

Scalable oversight is the challenge of supervising AI systems that can produce work humans can't fully verify. This becomes a critical problem as AI approaches superhuman capabilities—if an AI can generate answers, code, or strategies too complex for any human to check, how do we know it's actually being helpful and honest rather than subtly deceptive or wrong? The field has emerged as one of the central problems in AI alignment, with multiple major labs developing complementary approaches. As of early 2025, some techniques (like Constitutional AI) are already deployed in production, while others (like debate and weak-to-strong generalization) show promising experimental results but face fundamental open questions about whether they'll scale to truly superhuman systems.

The core problem explained simply

Think about how we currently train AI to be helpful and safe. The standard approach—called RLHF (Reinforcement Learning from Human Fe

What I Did (For MATS Application)

The Problem

AI systems might learn to fake being helpful — acting nice when watched, but planning to misbehave later. Like an employee who's perfect when the boss is around, but slacks off otherwise. How do you catch that?

The Old Approach

SAE Detection Project Audit Report

Date: 2026-01-07 Project: lightbright/sae_detection Auditor: Claude Code

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>Model Organism: Exfiltration Training Pipeline</title>
	<style>
	* {
	margin: 0;
	padding: 0;

	#!/usr/bin/env python3
	"""
	Experiment 7: Full 50k Feature Sweep
	=====================================
	Sweep ALL features, not just top 8 by correlation.
	Find the true needles in the haystack.

	Time estimate: ~8-10 hours on H100
	Cost estimate: ~$20-25 on cloud GPU rental