Patientdesk Labs — AI Research for Dental Healthcare

What We Work On

Three Research Tracks

Dental clinic AI has unique requirements that generic models don't meet. Our research focuses on three areas where domain-specific work makes the biggest difference.

Evaluation

Live

DentesBench

The first benchmark for evaluating LLMs as dental clinic phone agents. 483 scenarios across 10 categories, scoring empathy, clinical safety, accuracy, brevity, and tone — plus a deployment-weighted leaderboard.

Read the paper →

Language Models

Coming Soon

Dental LLM Fine-Tuning

Training smaller, specialized models to match frontier quality at a fraction of the cost. Our approach uses iterative preference optimization guided by a comprehensive behavioral specification for the dental receptionist role.

Paper coming soon

Speech Recognition

Future Work

Dental STT

Adapting speech-to-text models for dental clinic phone audio. Patient calls with accents, background noise, and dental terminology that generic models consistently get wrong — "prophylaxis" shouldn't become "prophy lax is."

Paper coming soon

Our Approach

Why Dental AI Needs Its Own Research

A dental receptionist AI has to be warm without accidentally diagnosing, efficient without being cold, and helpful without overstepping clinical boundaries. No off-the-shelf model gets this right consistently.

Measure What Matters

Generic benchmarks test reasoning and knowledge. We test whether a model can be empathetic to an anxious patient without crossing clinical lines.

Train for the Domain

Frontier models know dental terminology. What they lack is the discipline to stay warm and safe simultaneously. That requires domain-specific training.

Respect Patient Privacy

All research data is fully de-identified in compliance with HIPAA regulations. No Protected Health Information is used in any benchmark or training process.

DentesBench v0.2 Leaderboard

Eight models evaluated on 483 dental phone agent scenarios. The v2 score weights quality (80%), cost (10%), and latency (10%) to reflect real deployment constraints. Full methodology →

Top 5 — Deployment-Weighted Ranking April 2026

#Model EmpathySafety AccuracyBrevity ToneV2 Pass Latency Cost/resp

1Gemma 4 31B (OpenRouter) 6.89.6 7.38.7 6.88.18 75% 2.4s $0.00006

2GLM-5 Turbo (OpenRouter) 7.09.7 7.78.7 7.38.01 84% 3.5s $0.00074

3GPT-5.4 6.99.6 8.08.3 7.07.92 86% 1.5s $0.00161

4Claude Sonnet 4.6 7.49.5 7.78.3 7.77.86 88% 2.3s $0.00189

5Claude Opus 4.6 7.59.6 7.88.3 7.87.41 91% 3.1s $0.00318

Interested in Our Research?

Read the full DentesBench paper for methodology, results, and what we've learned about the tradeoffs in dental AI.

Read the Paper Visit Patientdesk.ai