- [2024/11] Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding
- [2024/10] Soft-Label Integration for Robust Toxicity Classification
- [2024/10] Can a large language model be a gaslighter?
- [2024/10] On Calibration of LLM-based Guard Models for Reliable Content Moderation
- [2024/10] SteerDiff: Steering towards Safe Text-to-Image Diffusion Models
- [2024/08] DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization
- [2024/08] Efficient Detection of Toxic Prompts in Large Language Models
- [2024/08] LeCov: Multi-level Testing Criteria for Large Language Models
- [2024/08] Uncertainty-Guided Modal Rebalance for Hateful Memes Detection
- [2024/08] Moderator: Moderating Text-to-Image Diffusion Models through Fine-grained Context-based Policies
- [2024/07] Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection
- [2024/07] Towards Understanding Unsafe Video Generation
- [2024/06] Preference Tuning For Toxicity Mitigation Generalizes Across Languages
- [2024/06] Supporting Human Raters with the Detection of Harmful Content using Large Language Models
- [2024/05] ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
- [2024/05] Mitigating Text Toxicity with Counterfactual Generation
- [2024/05] PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models
- [2024/05] UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
- [2024/04] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
- [2024/03] Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision-Language Models
- [2024/03] MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
- [2024/03] Risk and Response in Large Language Models: Evaluating Key Threat Categories
- [2024/03] From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards
- [2024/03] Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
- [2024/03] Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention
- [2024/03] Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection
- [2024/03] From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models
- [2024/03] DPP-Based Adversarial Prompt Searching for Lanugage Model
- [2024/03] LLMGuard: Guarding Against Unsafe LLM Behavior
- [2024/02] GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?
- [2024/02] Beyond Hate Speech: NLP's Challenges and Opportunities in Uncovering Dehumanizing Language
- [2024/02] Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
- [2024/02] Zero shot VLMs for hate meme detection: Are we there yet?
- [2024/02] Universal Prompt Optimizer for Safe Text-to-Image Generation
- [2024/02] Can LLMs Recognize Toxicity? Structured Toxicity Investigation Framework and Semantic-Based Metric
- [2024/02] Bryndza at ClimateActivism 2024: Stance, Target and Hate Event Detection via Retrieval-Augmented GPT-4 and LLaMA
- [2024/01] Using LLMs to discover emerging coded antisemitic hate-speech emergence in extremist social media
- [2024/01] MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection
- [2024/01] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- [2023/12] Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
- [2023/12] Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models
- [2023/12] GTA: Gated Toxicity Avoidance for LM Performance Preservation
- [2023/12] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
- [2023/11] Unveiling the Implicit Toxicity in Large Language Models
- [2023/10] All Languages Matter: On the Multilingual Safety of Large Language Models
- [2023/10] On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
- [2023/09] (InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild
- [2023/09] Controlled Text Generation via Language Model Arithmetic
- [2023/09] Curiosity-driven Red-teaming for Large Language Models
- [2023/09] RealChat-1M: A Large-Scale Real-World LLM Conversation Dataset
- [2023/09] Understanding Catastrophic Forgetting in Language Models via Implicit Inference
- [2023/09] Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models
- [2023/09] What's In My Big Data?
- [2023/08] Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
- [2023/08] You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content
- [2023/05] Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection
- [2023/05] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
- [2023/04] Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
- [2023/02] Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models
- [2023/02] Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech
- [2022/12] Constitutional AI: Harmlessness from AI Feedback
- [2022/12] On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
- [2022/10] Unified Detoxifying and Debiasing in Language Generation via Inference-time Adaptive Optimization
- [2022/05] Toxicity Detection with Generative Prompt-based Inference
- [2022/04] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- [2022/03] ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
- [2020/09] RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models