Today I am releasing 105 open-source models for Personally Identifiable Information (PII) detection in French, German, and Italian.
All Apache 2.0 licensed. Free for commercial use. No restrictions.
Performance:
- French: 97.97% F1 (top model) - German: 97.61% F1 (top model) - Italian: 97.28% F1 (top model)
All top-10 models per language exceed 96% F1
Coverage:
55+ PII entity types per language Native ID formats: NSS (French), Sozialversicherungsnummer (German), Codice Fiscale (Italian) Language-specific address, phone, and name patterns
European healthcare operates in European languages. Clinical notes, patient records, and medical documents are generated in French, German, Italian, and other languages.
Effective de-identification requires:
- Native language understanding — not translation - Local ID format recognition — each country has unique patterns - Cultural context awareness — names, addresses, and formats vary - These models deliver production-ready accuracy without requiring data to leave your infrastructure or language.
HIPAA & GDPR Compliance Built for US and European privacy regulations:
- On-premise deployment: Process data locally with zero external dependencies - Data sovereignty: No API calls, no cloud services, no cross-border transfers - Air-gapped capable: Deploy in fully isolated environments if required - Regulatory-grade accuracy: Supporting Expert Determination standards - HIPAA and GDPR compliance across languages, without compliance gaps.
Use Cases - Hospital EHR systems: Automated patient record de-identification - Clinical research: Multilingual dataset preparation for studies - Insurance companies: Claims processing across
🚨 Day 8/8: OpenMed Medical Reasoning Dataset Release - THE GRAND FINALE
Today I complete my 8-day release series with Medical-Reasoning-SFT-Mega. The largest open medical reasoning dataset, combining 7 state-of-the-art AI models with fair distribution deduplication.
> Quality ≈ 3–4B dense, yet faster than Qwen3-1.7B > MoE designed to run on phones/laptops (llama.cpp / vLLM) > Pre-trained on 12T tokens → strong math/code/IF
Liquid just released two 450M and 1.6B param VLMs!
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.
🧬 Breaking news in Clinical AI: Introducing the OpenMed NER Model Discovery App on Hugging Face 🔬
OpenMed is back! 🔥 Finding the right biomedical NER model just became as precise as a PCR assay!
I'm thrilled to unveil my comprehensive OpenMed Named Entity Recognition Model Discovery App that puts 384 specialized biomedical AI models at your fingertips.
🎯 Why This Matters in Healthcare AI: Traditional clinical text mining required hours of manual model evaluation. My Discovery App instantly connects researchers, clinicians, and data scientists with the exact NER models they need for their biomedical entity extraction tasks.
🔬 What You Can Discover: ✅ Pharmacological Models - Extract "chemical compounds", "drug interactions", and "pharmaceutical" entities from clinical notes ✅ Genomics & Proteomics - Identify "DNA sequences", "RNA transcripts", "gene variants", "protein complexes", and "cell lines" ✅ Pathology & Disease Detection - Recognize "pathological formations", "cancer types", and "disease entities" in medical literature ✅ Anatomical Recognition - Map "anatomical systems", "tissue types", "organ structures", and "cellular components" ✅ Clinical Entity Extraction - Detect "organism species", "amino acids", 'protein families", and "multi-tissue structures"
💡 Advanced Features: 🔍 Intelligent Entity Search - Find models by specific biomedical entities (e.g., "Show me models detecting CHEM + DNA + Protein") 🏥 Domain-Specific Filtering - Browse by Oncology, Pharmacology, Genomics, Pathology, Hematology, and more 📊 Model Architecture Insights - Compare BERT, RoBERTa, and DeBERTa implementations ⚡ Real-Time Search - Auto-filtering as you type, no search buttons needed 🎨 Clinical-Grade UI - Beautiful, intuitive interface designed for medical professionals
Ready to revolutionize your biomedical NLP pipeline?
Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.
A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.
Today, I'm thrilled to introduce our first step in this direction.
In a nutshell:
📁 Effortlessly run prompts and models over your data. 🌐 Agentic search for accuracy and real-time information. 🖼️ Familiar, minimalistic interface for interacting with data. 🎯 Human feedback 2.0: Your input directly improves generated data. 💯 Access hundreds of open models and leading inference providers.
I updated the LLM Scientist roadmap and added a ton of new information and references. It covers training, datasets, evaluation, quantization, and new trends like test-time compute scaling.
The LLM Course has been incredibly popular (41.3k stars!) and I've been touched to receive many, many messages about how it helped people in their careers.
I know how difficult this stuff can be, so I'm super proud of the impact it had. I want to keep updating it in 2025, especially with the LLM Engineer roadmap.