activation-steering

Official code for "Activation Steering for Accent Adaptation in Speech Foundation Models" (Interspeech 2026). Parameter-free accent adaptation via mean-shift steering vectors — no weight updates, consistent WER reductions across 8 accents.

speech-recognition whisper asr interspeech accent-adaptation representation-engineering activation-steering qwen2-audio

Updated Mar 17, 2026
Python

dmis-lab / ASGuard

Star

[ICLR 2026] ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

guard safety jailbreaking iclr interpretability activation-steering iclr2026

Updated Sep 30, 2025
Python

levashi / reprobe

Star

Phase-aware LLM activation steering and linear probing. A memory-efficient, practical implementation of Representation Engineering (RepE) for safety research.

transformers pytorch ai-safety mechanistic-interpretability llm-safety representation-engineering activation-steering linear-probes

Updated Mar 25, 2026
Python

sharanya-dasgupta001 / ARREST

Star

Accepted at 19th Conference of the European Chapter of the Association for Computational Linguistics, 2026

ai-safety adversarial-learning distribution-shift llm hallucination-mitigation activation-steering

Updated Jan 18, 2026
Python

Gugan22 / feature-crystallisation-poc

Star

Probing whether reasoning can be structurally crystallised into a small LLM via cyclic domain training + gap phases. Testing emergence at 300M parameters from scratch. SAE-based interpretability. Pruning crossover as depth measurement. Open collaboration.

collaboration pytorch transformer language-model sparse-autoencoders ai-safety open-research continual-learning feature-analysis mechanistic-interpretability chromadb small-language-model activation-steering custom-llm moral-reasoning emergent-capabilities neural-pruning feature-crystallization gap-phase-training

Updated Mar 15, 2026
Python

mc9625 / activation-steering-experiments

Star

Activation steering toolkit for Llama 3.2 3B — inject sensory-constructed vectors into model activations to alter processing dispositions. Web UI + API. Runs locally on consumer hardware.

language-models interpretability ai-art llm activation-steering

Updated Feb 24, 2026
Python

iamfaham / llm_steering

Star

An experimental comparison of prompt-based behavioral steering and activation steering in LLMs.

prompt steering-behaviors steering llms activation-steering

Updated Feb 13, 2026
Jupyter Notebook

aygp-dr / qwen3-steering

Star

Qwen3-0.6B activation steering: style vectors, lens contamination eval, CPRR methodology

transformer style-transfer property-based-testing literate-programming superposition mechanistic-interpretability llm-evaluation small-language-models representation-engineering activation-steering qwen3 steering-vectors actadd cprr conceptual-lens-drift

Updated Mar 26, 2026
Python

asc-lep-ius / tame-swarm

Star

Bio-inspired multi-scale competency architecture for LLMs. VCG auction routing (Mixture of Bidders) and activation steering for cognitive homeostasis, grounded in Michael Levin's TAME framework

gradio vcg-auction tame fastapi llm llm-training activation-steering