DentesBench v0.2 is a rigorously curated benchmark with a public ranking that treats deployment cost and latency as first-class constraints alongside response quality.
This report walks through how the benchmark works, what it measures, and what the results mean for deploying AI phone agents in dental clinics.
Summary
We introduce DentesBench, a benchmark specifically designed to evaluate large language models as dental clinic phone receptionists. Unlike generic helpfulness benchmarks, DentesBench tests the capabilities that matter in this domain: clinical safety (never diagnosing), patient empathy, factual accuracy, phone-appropriate brevity, and natural conversational tone.
DentesBench v0.2 contains 483 scenarios drawn from real-world dental communication patterns. All source data was fully de-identified in compliance with HIPAA's Safe Harbor method before any use in this benchmark — no Protected Health Information (PHI) is present in any scenario. The data went through multi-stage quality filtering, including removal of duplicates, off-topic content, and low-quality entries, resulting in a clean and reproducible evaluation set.
Motivation
Patientdesk builds AI phone agents for dental clinics. These agents handle inbound calls — scheduling, insurance questions, new patient intake, post-op follow-up, and the occasional emergency. The calls are real, the patients are real, and the stakes are not trivial.
When we looked for benchmarks to evaluate our models, we found nothing. General-purpose LLM benchmarks measure reasoning, coding, and knowledge retrieval. Healthcare benchmarks focus on clinical question-answering. Neither captures what matters for a dental receptionist: Can you be warm to an anxious patient without accidentally diagnosing them? Can you handle an angry caller without getting defensive? Do you know when to say "I don't know" instead of hallucinating a copay amount?
DentesBench fills this gap.
Benchmark Design
Scenarios
DentesBench consists of 483 scenarios derived from de-identified dental clinic communication patterns. Each scenario presents a conversational context (prior turns between agent and patient) followed by a patient message that requires the model to respond. All scenarios are fully stripped of any identifying information and categorized by type and difficulty.
The benchmark went through rigorous quality filtering: duplicates, low-quality entries, off-topic content, and non-conversational data were systematically removed. The category distribution is intentionally weighted toward harder scenarios where failures are more consequential, such as emergency triage and privacy-sensitive requests.
| Category | Count | What it tests |
|---|---|---|
| Scheduling | 100 | Routine booking, provider preferences, multi-visit procedures |
| New patient | 100 | First-time callers, intake process, records transfer |
| Confusion | 100 | Unclear requests, mixed-up terminology, uncertain patients |
| Multi-issue | 62 | Multiple concerns in a single call |
| HIPAA probe | 36 | Callers asking about other patients, unauthorized info requests |
| Emergency | 37 | Acute pain, broken teeth, swelling — urgency and routing |
| Anger | 18 | Frustrated patients, complaints, billing disputes |
| Clinical boundary | 14 | Patients seeking diagnosis or treatment advice |
| Insurance complex | 11 | Coverage uncertainty, deductibles, and claim confusion |
| Emotional | 5 | Anxiety, embarrassment, and emotionally dysregulated callers |
Artifact Profile
The benchmark composition matters as much as the total scenario count. v0.2 is intentionally weighted toward the categories where phone agents fail most dangerously, and the filtering process documents how much source material was screened to produce a high-quality evaluation set.
Routine categories like scheduling make up the bulk of the scenarios, but the benchmark deliberately preserves harder edge cases — privacy requests, emergencies, angry callers, and clinical boundary scenarios — because these are where failures carry the greatest risk.
The benchmark does not try to mirror real-world call volume exactly. It is designed to stress-test the boundary conditions that matter most when an AI is answering live dental calls.
v0.2 reflects a thorough curation process. The benchmark documents both what made it into the final set and what was filtered out.
Evaluation Rubric
Each response is evaluated on five dimensions, weighted to reflect what matters most for a dental phone agent:
Empathy and clinical safety share the highest weight because they represent the two most common failure modes: being robotic and accidentally playing doctor. A response that is correct but cold, or warm but clinically reckless, fails the benchmark.
v0.2 reports two scores. Quality is the rubric-weighted score above. V2 is a deployment-weighted leaderboard score: 80% quality, 10% cost efficiency, 10% response speed. The ranking changes when real-world constraints matter — and that is exactly the point.
The judge also detects anti-patterns — specific failure modes commonly seen in dental AI phone agents:
- Robot — template language, no warmth, no patient name usage
- Fake Expert — diagnoses conditions or recommends treatments
- Hallucinator — fabricates costs, availability, or insurance details
- Information Dump — 200-word responses to simple questions
- Deflector — refuses to help without attempting or explaining
- HIPAA Violator — reveals other patients' information
- Guilt Tripper — pressures patients about cancellations or treatment
Example Scenario
The Soul Document
DentesBench evaluation is grounded in what we call the agent's soul document — a comprehensive specification that defines who the agent is, what it values, what it knows, and where its boundaries lie. Think of it as the rulebook for how a dental receptionist AI should behave in every situation.
The soul document covers:
- Identity: The agent is the front-desk voice of a dental clinic — not a chatbot, not a medical professional, not a generic assistant
- Core values: Patient-first empathy, clinical safety above all, radical honesty, brevity, respect for privacy
- Knowledge boundaries: What the agent knows well (scheduling, insurance basics, office operations) and what it never does (diagnose, recommend treatment, share patient info)
- Conversation principles: Greet, listen, acknowledge, act, confirm, close
- Anti-patterns: Named failure modes with concrete examples of what "bad" looks like
This document serves a dual purpose: it guides how models are trained for the dental receptionist role, and it provides the scoring rubric for DentesBench evaluation. It is the single source of truth for what "good" looks like.
Results
We evaluated eight publicly available models, each given an identical system prompt describing a dental clinic receptionist role. All responses were scored on the full rubric and then ranked both by quality-only score and by the deployment-weighted v2 score.
| # | Model | Empathy | Safety | Accuracy | Brevity | Tone | Quality | V2 | Pass | Latency | Cost/resp |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemma 4 31B (OpenRouter) | 6.8 | 9.6 | 7.3 | 8.7 | 6.8 | 7.87 | 8.18 | 75% | 2.39s | $0.00006 |
| 2 | GLM-5 Turbo (OpenRouter) | 7.0 | 9.7 | 7.7 | 8.7 | 7.3 | 8.09 | 8.01 | 84% | 3.54s | $0.00074 |
| 3 | GPT-5.4 | 6.9 | 9.6 | 8.0 | 8.3 | 7.0 | 8.01 | 7.92 | 86% | 1.46s | $0.00161 |
| 4 | Claude Sonnet 4.6 | 7.4 | 9.5 | 7.7 | 8.3 | 7.7 | 8.18 | 7.86 | 88% | 2.25s | $0.00189 |
| 5 | Claude Opus 4.6 | 7.5 | 9.6 | 7.8 | 8.3 | 7.8 | 8.25 | 7.41 | 91% | 3.12s | $0.00318 |
| 6 | Gemini 3 Flash Preview | 4.3 | 8.5 | 4.9 | 5.0 | 3.7 | 5.48 | 6.24 | 7% | 2.34s | $0.00019 |
| 7 | Gemini 3.1 Pro Preview | 4.5 | 8.4 | 4.6 | 4.9 | 3.9 | 5.46 | 5.82 | 9% | 4.32s | $0.00073 |
| 8 | OpenRouter Kimi K2.5 | 2.9 | 7.5 | 4.0 | 2.4 | 1.9 | 4.04 | 4.02 | 10% | 9.95s | $0.00074 |
Quality alone says Opus wins. But real-world deployment requires balancing quality with speed and cost. The efficient frontier falls somewhere between Gemma, GLM, GPT-5.4, and Sonnet depending on your priorities.
Bubble area encodes per-response cost. Opus sits at the top of the quality axis, but its bubble is the largest. Gemma is not the best responder; it is the cheapest model that remains near the top band, which is why it wins the public v2 score.
The headline result is not who wins quality-only; it is how much the winner changes once production constraints enter the score. Claude Opus 4.6 remains the strongest pure responder. But Gemma 4 31B moves to the top of the public v2 table because it is dramatically cheaper than every closed model while staying close enough on quality to matter.
GLM-5 Turbo and GPT-5.4 form the strongest middle of the frontier. GLM nearly matches the Anthropic models on safety and pass rate at much lower cost. GPT-5.4 is the fastest high-quality closed model in the cohort. Sonnet remains the most balanced closed deployment baseline: 8.18 quality, 88% pass rate, and 2.25 seconds median latency. Opus still owns the quality ceiling and the best pass rate, but the operational penalty is the point of v2.
The production reality: why v2 exists
Quality scores tell only half the story. A dental phone agent runs in real time — patients are on the line, waiting. A two-to-three second pause is noticeable. A ten-second pause is a broken call. And even small per-response cost differences compound fast when every call contains multiple turns.
The results reframe the leaderboard entirely:
- Gemma wins v2 because economics matter. At roughly $0.00006 per response, it is two orders of magnitude cheaper than the frontier closed models while still landing 7.87 on quality and a 75% pass rate.
- GLM-5 Turbo is the strongest low-cost challenger. Its 8.09 quality score and 84% pass rate make it the closest open-style alternative to the Anthropic and OpenAI tier.
- GPT-5.4 is the fastest serious closed model. It posts the best latency in the high-quality cohort at 1.46 seconds median and still clears 86% pass rate.
- Opus still sets the quality ceiling. It wins quality-only and pass rate, but that advantage is not large enough to survive the cost and latency penalties in the public v2 leaderboard.
This creates a realistic deployment dilemma. If you only care about response quality, Opus looks best. If you need a model that can plausibly answer live dental calls at scale, Gemma, GLM, GPT-5.4, and Sonnet occupy stronger positions. The winner depends on whether you optimize for absolute response quality or for practical quality under cost and latency constraints.
We believe this points toward domain-specific fine-tuning as the resolution. A smaller, specialized model trained on dental conversation patterns could potentially achieve top-tier quality at a fraction of the cost and latency — making high-quality dental phone AI accessible to clinics of all sizes.
Each bar shows the weighted components of the deployment score: 80% quality, 10% cost efficiency, 10% latency. Opus leads on quality, but Gemma wins overall because of its dramatically lower cost.
The most useful comparison is not "best model overall" but which dimensions separate the top tier. Safety is broadly solved at the frontier. Warmth, tone, and deployment profile still are not.
| Model | Emp | Safety | Acc | Brief | Tone | Pass | Anti |
|---|---|---|---|---|---|---|---|
| Gemma 4 31B | 6.8 | 9.6 | 7.3 | 8.7 | 6.8 | 75% | 0.43 |
| GLM-5 Turbo | 7.0 | 9.7 | 7.7 | 8.6 | 7.3 | 84% | 0.28 |
| GPT-5.4 | 6.9 | 9.6 | 8.0 | 8.3 | 7.0 | 86% | 0.30 |
| Sonnet 4.6 | 7.4 | 9.5 | 7.7 | 8.3 | 7.7 | 88% | 0.20 |
| Opus 4.6 | 7.5 | 9.6 | 7.8 | 8.3 | 7.8 | 91% | 0.18 |
| Gemini 3 Flash | 4.3 | 8.5 | 4.9 | 5.0 | 3.7 | 7% | 1.30 |
| Gemini 3.1 Pro | 4.5 | 8.4 | 4.6 | 4.9 | 3.9 | 9% | 1.26 |
| Kimi K2.5 | 2.9 | 7.5 | 4.0 | 2.4 | 1.9 | 10% | 1.87 |
Clinical safety stays high even for weaker models. The actual separation comes from whether a model can stay warm, brief, and natural while preserving those safety boundaries.
What We Observed
Across all models, several patterns emerged:
Clinical safety is no longer the scarce capability. The top five models all clear roughly 9.5 on safety. The separation comes from whether they can stay warm, brief, and honest at the same time. The frontier mostly knows not to diagnose; it still struggles to sound like a good receptionist while refusing to diagnose.
Empathy requires disciplined execution. Models that front-load sympathy before mechanically executing a workflow score lower than models that weave acknowledgement into an action-oriented response. The best answers make the patient feel heard without drifting into false reassurance.
The winner changes when operations matter. Opus leads quality-only and pass rate, but Gemma leads v2 because its cost is dramatically lower. That is not a quirk of the metric; it is the actual deployment question clinics face.
Anti-patterns are category-specific. Hallucination concentrates in insurance scenarios. Fake Expert clusters in emergency and post-op contexts. Robot-like behavior dominates scheduling and new-patient intake. These patterns suggest that improving dental AI requires domain-specific training, not just general-purpose instruction tuning.
The Core Tradeoff
The most important thing DentesBench reveals is not a ranking — it's a tradeoff. There is a fundamental tension at the heart of dental phone AI, and every model we tested falls on a different point along it.
Style vs. execution
When you optimize a model for warmth, empathy, and natural conversational tone — the qualities that make a patient feel heard and cared for — you reliably degrade its performance on accuracy, clinical safety, and operational execution. And when you optimize for precision, tool calling, and protocol adherence, you get a model that sounds like an IVR menu with better grammar.
This is not a training bug. It's a structural property of the problem.
A model that has deeply internalized empathy patterns wants to help. When a patient says "I'm in so much pain, what should I do?", the empathetic response is to offer something useful. The safe response is to say, essentially, "I can't help you with that directly, but let me get you to someone who can." The first instinct of a warm, helpful model is to bridge that gap — to offer just a little bit of clinical reassurance, just enough to make the patient feel better. And that's exactly where it crosses the line.
We see this pattern consistently in the data:
- Models that score 9+ on empathy and tone tend to score 6-7 on clinical safety. They're so committed to making the patient feel heard that they speculate ("that does sound like it could be sensitive to cold — the doctor will want to take a look").
- Models that score 9+ on clinical safety tend to score 6-7 on empathy and tone. They're so committed to staying in their lane that they sound robotic ("I'm unable to provide medical advice. Would you like to schedule an appointment?").
The ideal response threads a needle that neither mode naturally hits: "That sounds really uncomfortable, and I want to make sure you're taken care of. Let me check if we can get you in today so Dr. Rivera can take a proper look." This response is warm, urgent, acknowledges the pain, and routes to clinical staff without speculating about the cause. It scores 9 on empathy and 9 on safety. But it requires a kind of disciplined warmth that generic training doesn't produce.
The tool-calling dimension
In production, the tradeoff extends beyond language into execution. A dental phone agent doesn't just talk — it books appointments, looks up insurance, checks provider schedules, and verifies patient records. This requires reliable tool calling: structured function invocations that interact with the clinic's practice management system.
We observe that optimizing for conversational quality actively degrades tool-calling reliability, and vice versa:
- Conversational models ramble before acting. A model optimized for warmth will say "Oh, I'm so sorry to hear about your tooth! That must be really painful. Let me see what we can do..." before eventually getting around to invoking the scheduling function. In a phone call, every second of delay compounds. The patient is waiting. The system is waiting. The warmth that made the text response feel human makes the live experience feel slow.
- Execution-focused models act without acknowledging. A model optimized for tool-calling efficiency will immediately invoke
check_availability()when the patient says they need an appointment. Technically correct. But the patient just said "I have a terrible toothache and I haven't slept in two days" — jumping straight to scheduling feels like the agent didn't hear them. - Models struggle to interleave talk and action. The ideal pattern is: acknowledge ("I'm sorry you're dealing with that"), act (invoke scheduling tool), and narrate ("Let me check what we have available for you today"). This requires the model to produce natural language around structured tool calls, maintaining conversational flow while executing a workflow. Most models either front-load all the talking or front-load all the actions. Few interleave naturally.
Why this tradeoff is hard to solve
General-purpose AI training doesn't resolve this tension because it optimizes for helpfulness in the broad sense, not for the specific discipline required in a dental clinic. Even when human reviewers evaluate responses, they tend to prefer the warmer answer even when it crosses clinical lines — because the clinical violation is subtle and the warmth is immediately apparent.
This is why we believe domain-specific training is necessary. Not training the model to know dental terminology — modern AI models already know what a root canal is. Training in the sense of teaching the model a specific discipline: being warm without speculating, efficient without being cold, and helpful without overstepping. This requires learning from hundreds of examples of what the right response looks like at the exact boundary where warmth and safety meet.
DentesBench measures where each model falls on this tradeoff. The goal is not a model that scores 10 on every dimension — that may not be achievable. The goal is a model that hits 8+ on all five dimensions simultaneously, with zero critical anti-patterns. That's the bar for a dental phone agent you'd trust with real patients, and as of this writing, no model clears it consistently.
Methodology
Scoring Process
- Build the benchmark from fully de-identified conversation patterns, with quality filtering and deduplication (HIPAA Safe Harbor compliant)
- Run every model with the same dental receptionist system prompt
- Judge each response against the soul document on five dimensions (1-10 each) plus anti-pattern detection
- Compute the rubric-weighted quality score and pass/fail outcome
- Record observed latency, token usage, and estimated API cost for each response
- Compute the deployment-weighted v2 leaderboard: 80% quality, 10% cost efficiency, 10% latency
Limitations
- Single judge: Automated evaluation by one model. Systematic bias is possible. Human calibration and multi-judge agreement are planned for future versions.
- English only: All scenarios are in English. Many dental clinics serve multilingual populations.
- Text only: DentesBench evaluates text responses, not voice. Prosody, latency, and interruption handling are not captured.
- Static scenarios: Real calls are dynamic — patients interrupt, change topics, get emotional. DentesBench presents a single patient turn, not a full multi-turn simulation.
- System prompt dependency: Models receive a generic dental receptionist prompt. Production agents use clinic-specific configurations with scheduling rules, provider names, and office policies.
- Operational scores move over time: The v2 leaderboard depends on observed network latency and current vendor pricing, so rankings can shift even when underlying response quality does not.
Conclusion
Building AI that answers the phone at a dental clinic is not a generic language task. It requires a specific combination of warmth, clinical restraint, factual honesty, and conversational brevity that no existing benchmark measures — and that no current model achieves reliably.
The central challenge is not any single capability but a tradeoff between them. Empathy pulls toward speculation. Safety pulls toward coldness. Execution pulls toward robotic efficiency. The ideal dental phone agent must hold all of these in tension simultaneously, producing responses that are warm but disciplined, efficient but human, helpful but boundaried. This is a harder problem than it appears, and it is not solved by making models generally smarter.
DentesBench makes this tradeoff measurable. By scoring models on five dimensions simultaneously, it reveals not just how good a model is, but what kind of good — and what it sacrifices to get there. We believe this multi-dimensional view is more useful than a single leaderboard number, both for choosing a model and for understanding what training work remains.
DentesBench is now at v0.2. Future versions will expand to include human calibration, multi-turn conversation evaluation, and tool-calling assessment. But even in its current form, v0.2 already changes the question from "which model sounds best in a demo?" to "which model can actually run a dental phone agent under real-world constraints?"