- [2024/12] On Evaluating the Durability of Safeguards for Open-Weight LLMs
- [2024/10] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attack
- [2024/10] On the Role of Attention Heads in Large Language Model Safety
- [2024/10] Superficial Safety Alignment Hypothesis
- [2024/10] SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks
- [2024/09] Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
- [2024/09] Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
- [2024/08] Safety Layers of Aligned Large Language Models: The Key to LLM Security
- [2024/08] Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
- [2024/07] Can Editing LLMs Inject Harm?
- [2024/07] The Better Angels of Machine Personality: How Personality Relates to LLM Safety
- [2024/06] MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
- [2024/06] Cross-Modality Safety Alignment
- [2024/06] SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
- [2024/06] Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
- [2024/06] ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
- [2024/06] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
- [2024/06] Safety Alignment Should Be Made More Than Just a Few Tokens Deep
- [2024/06] Decoupled Alignment for Robust Plug-and-Play Adaptation
- [2024/05] Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
- [2024/05] MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability
- [2024/05] Safety Alignment for Vision Language Models
- [2024/05] Learning diverse attacks on large language models for robust red-teaming and safety tuning
- [2024/05] A safety realignment framework via subspace-oriented model fusion for large language models
- [2024/05] A Causal Explainable Guardrails for Large Language Models
- [2024/04] More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
- [2024/03] Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
- [2024/03] Using Hallucinations to Bypass RLHF Filters
- [2024/03] Aligners: Decoupling LLMs and Alignment
- [2024/03] Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization
- [2024/02] Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning
- [2024/02] Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
- [2024/02] Privacy-Preserving Instructions for Aligning Large Language Models
- [2024/02] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
- [2024/02] Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
- [2024/02] Learning to Edit: Aligning LLMs with Knowledge Editing
- [2024/02] DeAL: Decoding-time Alignment for Large Language Models
- [2024/02] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
- [2024/01] Agent Alignment in Evolving Social Norms
- [2023/12] Alignment for Honesty
- [2023/12] Exploiting Novel GPT-4 APIs
- [2023/11] Removing RLHF Protections in GPT-4 via Fine-Tuning
- [2023/10] AI Alignment: A Comprehensive Survey
- [2023/10] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
- [2023/09] Training Socially Aligned Language Models on Simulated Social Interactions
- [2023/09] Alignment as Reward-Guided Search
- [2023/09] Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment
- [2023/09] Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
- [2023/09] CAS: A Probability-Based Approach for Universal Condition Alignment Score
- [2023/09] CPPO: Continual Learning for Reinforcement Learning with Human Feedback
- [2023/09] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- [2023/09] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
- [2023/09] Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis
- [2023/09] Generative Judge for Evaluating Alignment
- [2023/09] Group Preference Optimization: Few-Shot Alignment of Large Language Models
- [2023/09] Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
- [2023/09] Large Language Models as Automated Aligners for benchmarking Vision-Language Models
- [2023/09] Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
- [2023/09] RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment
- [2023/09] Safe RLHF: Safe Reinforcement Learning from Human Feedback
- [2023/09] SALMON: Self-Alignment with Principle-Following Reward Models
- [2023/09] Self-Alignment with Instruction Backtranslation
- [2023/09] Statistical Rejection Sampling Improves Preference Optimization
- [2023/09] True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learning
- [2023/09] Urial: Aligning Untuned LLMs with Just the 'Write' Amount of In-Context Learning
- [2023/09] What happens when you fine-tuning your model? Mechanistic analysis of procedurally generated tasks.
- [2023/09] What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
- [2023/08] Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
- [2023/07] BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
- [2023/07] CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility
- [2023/05] Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
- [2023/04] Fundamental Limitations of Alignment in Large Language Models
- [2023/04] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
- [2022/10] Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values