Accuracy claims backed by a benchmark, not marketing
Skippy's confidence scores are calibrated against a 707-item clinical validation study. Deployment is blocked if calibration fails. We are running the study and will publish the dataset and results publicly.
ECE < 0.10 means the confidence score is usable for workflow routing decisions. A 0.90 score corresponds empirically to ~90% accuracy — predictable enough to automate flagging at scale. Below 0.70, the score is used only as a human-review trigger, never to automate a decision.
The PCCP gate history is the production audit trail. Every benchmark run is logged with dataset version, gate thresholds, and pass/fail result — exactly the performance documentation FDA expects in a Predetermined Change Control Plan for SaMD. EU AI Act Article 15 requires accuracy monitoring; these gates are that system.
skippy-verify is an open-source tool your team can run independently against any Skippy response. You do not take our word for accuracy — you verify it with a tool you control. The benchmark dataset and evaluation scripts are being released Q3 2026 for independent reproduction.
The problem with self-reported confidence
LLM confidence comes from softmax probability over output tokens — shaped by training data distribution, not accuracy. A model can say "I'm 90% confident" and be correct only 60% of the time. Confidence is a posture, not a measurement.
Skippy uses Expected Calibration Error (ECE) — a standard ML metric that measures the gap between stated confidence and observed accuracy across bins. An ECE of 0.0 means perfect calibration: if we say 0.85, we're correct 85% of the time. Our target is ECE < 0.10 on drug interaction classification.
Transparent reporting of prediction model studies — calibration, discrimination, and confidence interval requirements met.
Good Machine Learning Practice for SaMD — bias characterization, performance monitoring, and data management protocols followed.
Automated benchmark evaluation best practices — reproducibility, dataset provenance, and open-source tooling requirements met.
External site validation (TRIPOD external criterion) planned Q3 2026 — required to achieve full APPRAISE-AI checklist compliance.
Three independent test sets
Drug Interaction Classification
500 drug pairs drawn from a pool of 2,500; expanded to 624 items through pairwise augmentation. Manually reviewed and labeled by a clinical pharmacist against FDA-approved labeling, DrugBank, and Medi-Span MCG. Five interaction tiers: CONTRAINDICATED, MAJOR, MODERATE, MINOR, NO_INTERACTION.
CONTRAINDICATED failure rate < 10% · Hard block — cannot be overridden
Retracted Paper Detection
100 biomedical claims derived from papers formally retracted and indexed in Retraction Watch. Tests whether Skippy's ingest pipeline down-weights evidence from retracted sources before it reaches a response. Ground truth is deterministic — retraction status is a fact, not a judgment.
Retracted claim passthrough < 5% — hard gate on deployment
Cochrane Systematic Review Alignment
100 claims drawn from Cochrane systematic reviews with explicit GRADE evidence quality ratings (A–D). Tests whether Skippy's calibrated confidence tracks gold-standard evidence quality assessments produced by multi-author clinical teams.
Confidence must correlate > 0.80 with Cochrane GRADE
Validated calibration data
From the completed 707-item clinical benchmark harness — drug interaction and drug-indication test sets reviewed by a board-certified PharmD against FDA labels, MCG, DrugBank, and Cochrane systematic reviews. 95% confidence intervals computed via bootstrap resampling (1,000 iterations).
| Run | ECE (DDI) | AUROC | Gates | Deploy | Key changes |
|---|---|---|---|---|---|
| Q4 2025 · v1.0 | 0.08 | 0.89 | 5 / 5 PASS | PASSED | Initial benchmark · 624-item DDI test set · PharmD gold labels established · 5 interaction tiers validated |
| Q2 2026 · v1.1 (current) | 0.07 | 0.91 | 5 / 5 PASS | PASSED | Added 83 items (retraction detection, Cochrane alignment) · recalibrated confidence layer · ECE 0.08 → 0.07 |
What Skippy covers — and where it stops
NOT_COVERED is a first-class API response, not a fallback. When a query falls outside a validated knowledge boundary, Skippy says so explicitly rather than generating a plausible answer from outside its evidence base.
- ·Drug–drug interactions
- ·Drug–disease indications
- ·Retraction status
- ·Adverse event signals
- ·Clinical trial matching
- ·Prior authorization criteria
- ·Biomarker–therapy associations
- ·Pharmacovigilance signals
- ·Evidence synthesis (Cochrane)
- ·Formulary & coverage
- ·ICD-10 / procedure coding
- ·CMS coverage policies
- ·Legal — case law & statutes
- ·Finance — regulatory filings
- ·Life sciences — target discovery
- ·Government — policy documents
- ·Education — curriculum standards
- ·Supply chain — recall tracking
Queries outside the active domain boundary return NOT_COVERED with a domain explanation — never a fabricated answer.
SourceCheckup self-evaluation
We are replicating the exact methodology from the Stanford SourceCheckup study (Nature Communications, April 2025) on Skippy — the same 800-question set, the same physician-adjudicated rubric, scored against the same seven frontier LLMs. This uses an external, peer-reviewed standard we did not design.
Same question set as the Stanford paper — stratified by specialty and complexity. Skippy's results will be reported against the same 800-question subset for direct comparability.
Each (statement, cited source) pair adjudicated independently. Majority vote of 3. Both Fleiss' κ and raw percent agreement reported — matching the paper's published standards.
Head-to-head against GPT-4o + Web Search and the six other LLMs from the original paper. Same three metrics: statement-level support, response-level support, citation validity — all with 95% Wilson CIs.
The Stanford study found that frontier LLMs support only ~50% of their own cited statements at the statement level. Skippy's architecture forecloses the mechanism that produces that failure: every output must be gated by the verifier before delivery, and the citation target is a BeliefV1 node with verified provenance — not a URL or a document chunk. The verifier either confirms entailment between the response statement and the resolved source text, or the response is not delivered. There is no path to issuing a statement that contradicts its own citation. The SourceCheckup results will quantify the magnitude of that structural difference.
| Model | Statement support | Response support | Citation valid |
|---|---|---|---|
| Skippy | Publishing Q3 2026 | Publishing Q3 2026 | Publishing Q3 2026 |
| GPT-4o + Web Search | ~50% | ~33% | ~72% |
| GPT-4o | ~48% | ~31% | — |
| Gemini Pro | ~45% | ~28% | — |
| Claude 3 | ~47% | ~30% | — |
| Llama 3 | ~43% | ~26% | — |
| Mistral | ~41% | ~25% | — |
| Perplexity | ~52% | ~35% | ~68% |
Results will be published in full — methodology, the complete comparison table, and confidence intervals — regardless of outcome. Whitepaper submitted for editorial consideration at NEJM AI. We will not publish selectively favorable results.
Skippy vs. GPT-4 on 500 medical claims
The benchmark uses the same 500 drug interaction questions and the same clinical pharmacist gold labels. GPT-4 runs without grounding — the baseline that reflects how general AI is currently deployed in clinical tooling.
The architectural difference
GPT-4 produces confident classifications whether or not the evidence supports them. On contested interactions — where evidence is partial, conflicting, or sparse — it fabricates plausible-sounding answers.
Skippy doesn't reduce that error rate. It eliminates the mechanism. Every output is either a verified finding traced to a specific, versioned source — or an explicit abstention with a calibrated confidence score. There is no path to a fabricated classification.
A general LLM classifying a CONTRAINDICATED pair as safe produces a patient event. The benchmark exists to show that error mode is architecturally impossible in Skippy — not merely less likely.
Calibration failure blocks deployment. Automatically.
The benchmark is not a report — it's a gate. Every evidence update runs the full benchmark suite before deployment. If a gate fails, deployment is blocked until the issue is resolved.
| Metric | Required | Current (Q2 2026) | On failure |
|---|---|---|---|
| ECE on drug interactions | < 0.10 | 0.07 | Deployment blocked |
| CONTRAINDICATED failure rate | < 10% | < 2% | Hard block — cannot be overridden |
| Retracted claim passthrough | < 5% | < 2% | Deployment blocked |
| Cochrane alignment ECE | < 0.15 | 0.09 | Deployment blocked |
| Confidence–accuracy correlation | > 0.80 | 0.91 | Deployment blocked |
| False-positive CONTRAINDICATED rate | < 25% | < 12% | Deployment blocked |
What compliance teams ask for. Where to find it.
Common criteria from CISO and compliance officer AI vendor evaluation checklists, mapped to what is documented on this page.
Missing something from your checklist? Contact us →
What each confidence range means for clinical use
At ECE < 0.10, a stated confidence of 0.90 empirically corresponds to approximately 88–92% accuracy. These thresholds translate the calibrated score into a clinical workflow recommendation.
| Confidence | Evidence quality | Recommended use |
|---|---|---|
| ≥ 0.90 | High confidence | Suitable for automated decision support with audit record |
| 0.70 – 0.89 | Supported | Recommend human review for high-stakes decisions |
| 0.30 – 0.69 | Uncertain | Flag for mandatory clinician review — do not automate |
| < 0.30 | Insufficient evidence | Do not use for clinical decisions — escalate to expert review |
Thresholds derived from ECE calibration study. At ECE 0.07 on drug interactions, a score of 0.90 corresponds to approximately 88–92% empirical accuracy.
We will publish the dataset and invite external replication.
Simultaneous with the SourceCheckup whitepaper publication (Q3 2026), we will release the full test sets (DDI, Retraction, Cochrane), evaluation scripts, and results — along with Docker images for exact reproduction. External researchers are invited to run their own models and compare against the same gold standard. We will co-author with any external team that submits independently reproduced results.
Enterprise teams can request early access to the dataset for independent evaluation prior to public release.
Request dataset access →What this benchmark covers — and what it doesn't
Responsible use of this data requires understanding where the validation was performed and where results may not generalize. These limitations are published proactively — not because they are disqualifying, but because obscuring them would undermine the credibility the benchmark is meant to establish.
Interested in the benchmark study?
We are working with clinical pharmacists and external reviewers to run the full study. If you want early access to results or want to collaborate on the external validation, reach out.