safety - a zzfive Collection

zzfive 's Collections

inference optimization

RL+reason model

3d

cv

safety

updated 21 days ago

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

Paper • 2502.05163 • Published Feb 7, 2025 • 22
CRANE: Reasoning with constrained LLM generation

Paper • 2502.09061 • Published Feb 13, 2025 • 21
Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models

Paper • 2502.15799 • Published Feb 18, 2025 • 7
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

Paper • 2502.16776 • Published Feb 24, 2025 • 6
LettuceDetect: A Hallucination Detection Framework for RAG Applications

Paper • 2502.17125 • Published Feb 24, 2025 • 13
SafeArena: Evaluating the Safety of Autonomous Web Agents

Paper • 2503.04957 • Published Mar 6, 2025 • 21
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Paper • 2504.01308 • Published Apr 2, 2025 • 14
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

Paper • 2504.10430 • Published Apr 14, 2025 • 5
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits

Paper • 2504.03767 • Published Apr 2, 2025 • 3
Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

Paper • 2504.12782 • Published Apr 17, 2025 • 3
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Paper • 2504.13203 • Published Apr 15, 2025 • 35
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Paper • 2504.15585 • Published Apr 22, 2025 • 13
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Paper • 2505.01456 • Published May 1, 2025 • 2
Teaching Models to Understand (but not Generate) High-risk Data

Paper • 2505.03052 • Published May 5, 2025 • 6
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

Paper • 2505.14633 • Published May 20, 2025 • 4
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Paper • 2505.15404 • Published May 21, 2025 • 13
Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

Paper • 2505.15656 • Published May 21, 2025 • 15
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Paper • 2505.16186 • Published May 22, 2025 • 7
Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach

Paper • 2505.18882 • Published May 24, 2025 • 14
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Paper • 2505.21784 • Published May 27, 2025 • 17
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Paper • 2506.14866 • Published Jun 17, 2025 • 5
Automating Steering for Safe Multimodal Large Language Models

Paper • 2507.13255 • Published Jul 17, 2025 • 4
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Paper • 2507.11097 • Published Jul 15, 2025 • 64
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Paper • 2507.16534 • Published Jul 22, 2025 • 9
Personalized Safety Alignment for Text-to-Image Diffusion Models

Paper • 2508.01151 • Published Aug 2, 2025 • 9
Data and AI governance: Promoting equity, ethics, and fairness in large language models

Paper • 2508.03970 • Published Aug 5, 2025 • 1
How AI Impacts Skill Formation

Paper • 2601.20245 • Published 27 days ago • 8