safety
updated
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM
Guardrails
Paper
• 2502.05163
• Published
• 22
CRANE: Reasoning with constrained LLM generation
Paper
• 2502.09061
• Published
• 21
Investigating the Impact of Quantization Methods on the Safety and
Reliability of Large Language Models
Paper
• 2502.15799
• Published
• 7
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and
Improvement
Paper
• 2502.16776
• Published
• 6
LettuceDetect: A Hallucination Detection Framework for RAG Applications
Paper
• 2502.17125
• Published
• 13
SafeArena: Evaluating the Safety of Autonomous Web Agents
Paper
• 2503.04957
• Published
• 21
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to
Gaussian Noise in Perturbation-based Attacks
Paper
• 2504.01308
• Published
• 14
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety
in Large Language Models
Paper
• 2504.10430
• Published
• 5
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major
Security Exploits
Paper
• 2504.03767
• Published
• 3
Set You Straight: Auto-Steering Denoising Trajectories to Sidestep
Unwanted Concepts
Paper
• 2504.12782
• Published
• 3
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Paper
• 2504.13203
• Published
• 35
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training
and Deployment
Paper
• 2504.15585
• Published
• 13
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and
Attack-Defense Evaluation
Paper
• 2505.01456
• Published
• 2
Teaching Models to Understand (but not Generate) High-risk Data
Paper
• 2505.03052
• Published
• 6
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values
Prioritization with AIRiskDilemmas
Paper
• 2505.14633
• Published
• 4
How Should We Enhance the Safety of Large Reasoning Models: An Empirical
Study
Paper
• 2505.15404
• Published
• 13
Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data
Could Be Secretly Stolen!
Paper
• 2505.15656
• Published
• 15
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Paper
• 2505.16186
• Published
• 7
Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent
Approach
Paper
• 2505.18882
• Published
• 14
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for
Policy-embedded CoT Data Creation
Paper
• 2505.21784
• Published
• 17
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Paper
• 2506.14866
• Published
• 5
Automating Steering for Safe Multimodal Large Language Models
Paper
• 2507.13255
• Published
• 4
The Devil behind the mask: An emergent safety vulnerability of Diffusion
LLMs
Paper
• 2507.11097
• Published
• 64
Frontier AI Risk Management Framework in Practice: A Risk Analysis
Technical Report
Paper
• 2507.16534
• Published
• 9
Personalized Safety Alignment for Text-to-Image Diffusion Models
Paper
• 2508.01151
• Published
• 9
Data and AI governance: Promoting equity, ethics, and fairness in large
language models
Paper
• 2508.03970
• Published
• 1
How AI Impacts Skill Formation
Paper
• 2601.20245
• Published
• 8