Skip to content

Instantly share code, notes, and snippets.

View bigsnarfdude's full-sized avatar
💭
I may be slow to respond.

BigsnarfDude bigsnarfdude

💭
I may be slow to respond.
View GitHub Profile
@bigsnarfdude
bigsnarfdude / unlearning.md
Created January 9, 2026 05:33
unlearning.md

What is "Unlearning"?

Making a model forget specific knowledge while keeping everything else it knows.

Why do this?

  • Remove dangerous info (bioweapons, hacking)
  • Delete private data (GDPR compliance)
  • Remove copyrighted content
@bigsnarfdude
bigsnarfdude / sae.md
Created January 9, 2026 03:42
sae.md

Sparse Autoencoders (SAEs) Study Guide

A summary of our conversation on understanding and building SAEs for LLM interpretability.


What Problem Do SAEs Solve?

Neural networks like GPT are powerful but opaque. We'd like to understand them by looking at individual neurons, but there's a problem: single neurons don't correspond to single concepts. One neuron might activate for academic citations, English dialogue, HTTP requests, and Korean text all at once.

@bigsnarfdude
bigsnarfdude / model_org.html
Created January 8, 2026 22:21
model_org.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Model Organism: Exfiltration Training Pipeline</title>
<style>
* {
margin: 0;
padding: 0;
@bigsnarfdude
bigsnarfdude / future.md
Last active January 8, 2026 17:45
future.md

Project Lightbright: January 8, 2026

Where We Are & Where We Go From Here


Part 1: Where We Are

The Confirmed Negative Result

After exhaustive experimentation across three models (Gemma-27B, GPT-OSS-20B, Llama-8B) and three SAE architectures (GemmaScope 2, Goodfire TopK, LlamaScope 32x), we have a clear answer:

Plan: Building Minimal Pairs for AF Intent Detection


Phase 0: Define Success Criteria

Before building anything:

  • Define what "same vocabulary" means (exact tokens? same key phrases? similar word distribution?)
  • Define what "different intent" means operationally (how will you verify?)
@bigsnarfdude
bigsnarfdude / intent.md
Last active January 8, 2026 05:36
intent.md

Research Directions Worth Investing In

Based on your findings


Tier 1: High value, builds on your work

Direction Why
@bigsnarfdude
bigsnarfdude / ScalableOversight.md
Created January 8, 2026 03:03
ScalableOversight.md

Scalable oversight: How to supervise AI that's smarter than you

Scalable oversight is the challenge of supervising AI systems that can produce work humans can't fully verify. This becomes a critical problem as AI approaches superhuman capabilities—if an AI can generate answers, code, or strategies too complex for any human to check, how do we know it's actually being helpful and honest rather than subtly deceptive or wrong? The field has emerged as one of the central problems in AI alignment, with multiple major labs developing complementary approaches. As of early 2025, some techniques (like Constitutional AI) are already deployed in production, while others (like debate and weak-to-strong generalization) show promising experimental results but face fundamental open questions about whether they'll scale to truly superhuman systems.

The core problem explained simply

Think about how we currently train AI to be helpful and safe. The standard approach—called RLHF (Reinforcement Learning from Human Fe

@bigsnarfdude
bigsnarfdude / research_explainer.md
Last active January 7, 2026 20:00
research_explainer.md

What I Did (For MATS Application)

The Problem

AI systems might learn to fake being helpful — acting nice when watched, but planning to misbehave later. Like an employee who's perfect when the boss is around, but slacks off otherwise. How do you catch that?

The Old Approach

@bigsnarfdude
bigsnarfdude / audit.md
Created January 7, 2026 16:53
audit.md

SAE Detection Project Audit Report

Date: 2026-01-07 Project: lightbright/sae_detection Auditor: Claude Code


Executive Summary

@bigsnarfdude
bigsnarfdude / sweep.py
Created January 7, 2026 13:53
sweep.py
#!/usr/bin/env python3
"""
Experiment 7: Full 50k Feature Sweep
=====================================
Sweep ALL features, not just top 8 by correlation.
Find the true needles in the haystack.
Time estimate: ~8-10 hours on H100
Cost estimate: ~$20-25 on cloud GPU rental