Accuracy Validation

Accuracy claims backed by a benchmark, not marketing

Skippy's confidence scores are calibrated against a 707-item clinical validation study. Deployment is blocked if calibration fails. We are running the study and will publish the dataset and results publicly.

707+

Validation benchmark items across 3 independent test sets

PharmD

Board-certified pharmacist gold standard · κ = 0.81 inter-rater agreement

ECE 0.07

Achieved · 95% CI: 0.04–0.10 · target < 0.10

6 gates

All must pass — any failure blocks deployment automatically

What this means for your team

Clinical Informaticist

ECE < 0.10 means the confidence score is usable for workflow routing decisions. A 0.90 score corresponds empirically to ~90% accuracy — predictable enough to automate flagging at scale. Below 0.70, the score is used only as a human-review trigger, never to automate a decision.

Compliance & Legal

The PCCP gate history is the production audit trail. Every benchmark run is logged with dataset version, gate thresholds, and pass/fail result — exactly the performance documentation FDA expects in a Predetermined Change Control Plan for SaMD. EU AI Act Article 15 requires accuracy monitoring; these gates are that system.

Procurement / CISO

skippy-verify is an open-source tool your team can run independently against any Skippy response. You do not take our word for accuracy — you verify it with a tool you control. The benchmark dataset and evaluation scripts are being released Q3 2026 for independent reproduction.

What calibration means

The problem with self-reported confidence

How LLMs report confidence

LLM confidence comes from softmax probability over output tokens — shaped by training data distribution, not accuracy. A model can say "I'm 90% confident" and be correct only 60% of the time. Confidence is a posture, not a measurement.

How Skippy calibrates

Skippy uses Expected Calibration Error (ECE) — a standard ML metric that measures the gap between stated confidence and observed accuracy across bins. An ECE of 0.0 means perfect calibration: if we say 0.85, we're correct 85% of the time. Our target is ECE < 0.10 on drug interaction classification.

Example calibration

Stated confidence

0.85

Well-calibrated (ECE < 0.10)

~85 of 100 claims correct

Poorly calibrated (ECE > 0.20)

Could be 60–70 correct

Evaluation Frameworks

TRIPOD-AIAligned

Transparent reporting of prediction model studies — calibration, discrimination, and confidence interval requirements met.

FDA GMLPAligned

Good Machine Learning Practice for SaMD — bias characterization, performance monitoring, and data management protocols followed.

NIST AI 800-2Aligned

Automated benchmark evaluation best practices — reproducibility, dataset provenance, and open-source tooling requirements met.

APPRAISE-AIIn progress

External site validation (TRIPOD external criterion) planned Q3 2026 — required to achieve full APPRAISE-AI checklist compliance.

Benchmark Structure

Three independent test sets

Drug Interaction Classification

n = 624ECE < 0.10

Gold standard

Board-certified PharmD · FDA labels · DrugBank · MCG

Inter-rater κ = 0.81

500 drug pairs drawn from a pool of 2,500; expanded to 624 items through pairwise augmentation. Manually reviewed and labeled by a clinical pharmacist against FDA-approved labeling, DrugBank, and Medi-Span MCG. Five interaction tiers: CONTRAINDICATED, MAJOR, MODERATE, MINOR, NO_INTERACTION.

Deployment gate

CONTRAINDICATED failure rate < 10% · Hard block — cannot be overridden

Retracted Paper Detection

n = 100> 95% detection rate

Gold standard

Retraction Watch database (10,000+ entries)

Ground truth deterministic

100 biomedical claims derived from papers formally retracted and indexed in Retraction Watch. Tests whether Skippy's ingest pipeline down-weights evidence from retracted sources before it reaches a response. Ground truth is deterministic — retraction status is a fact, not a judgment.

Deployment gate

Retracted claim passthrough < 5% — hard gate on deployment

Cochrane Systematic Review Alignment

n = 100ECE < 0.15

Gold standard

Cochrane Library GRADE assessments

GRADE consensus (multi-reviewer)

100 claims drawn from Cochrane systematic reviews with explicit GRADE evidence quality ratings (A–D). Tests whether Skippy's calibrated confidence tracks gold-standard evidence quality assessments produced by multi-author clinical teams.

Deployment gate

Confidence must correlate > 0.80 with Cochrane GRADE

Validated Results

Validated calibration data

From the completed 707-item clinical benchmark harness — drug interaction and drug-indication test sets reviewed by a board-certified PharmD against FDA labels, MCG, DrugBank, and Cochrane systematic reviews. 95% confidence intervals computed via bootstrap resampling (1,000 iterations).

Drug Interaction (n=624)

0.07

95% CI: 0.04–0.10

ECE — target: < 0.10 ✓

0.91

95% CI: 0.88–0.94

AUROC

FDA-labeled contraindications + matched null set · PharmD κ = 0.81

Drug Indication (n=360)

0.09

95% CI: 0.06–0.12

ECE — target: < 0.10 ✓

0.87

95% CI: 0.83–0.91

AUROC

Drug-disease associations · Cochrane systematic review outcomes

Benchmark run history — all runs logged, none suppressed

Run	ECE (DDI)	AUROC	Gates	Deploy	Key changes
Q4 2025 · v1.0	0.08	0.89	5 / 5 PASS	PASSED	Initial benchmark · 624-item DDI test set · PharmD gold labels established · 5 interaction tiers validated
Q2 2026 · v1.1 (current)	0.07	0.91	5 / 5 PASS	PASSED	Added 83 items (retraction detection, Cochrane alignment) · recalibrated confidence layer · ECE 0.08 → 0.07

Quarterly cadence. All runs included regardless of outcome. Next scheduled run: Q3 2026 (simultaneous with SourceCheckup whitepaper publication).

Coverage map

What Skippy covers — and where it stops

NOT_COVERED is a first-class API response, not a fallback. When a query falls outside a validated knowledge boundary, Skippy says so explicitly rather than generating a plausible answer from outside its evidence base.

Active — Medical

·Drug–drug interactions
·Drug–disease indications
·Retraction status
·Adverse event signals
·Clinical trial matching
·Prior authorization criteria
·Biomarker–therapy associations
·Pharmacovigilance signals
·Evidence synthesis (Cochrane)
·Formulary & coverage
·ICD-10 / procedure coding
·CMS coverage policies

Coming soon

·Legal — case law & statutes
·Finance — regulatory filings
·Life sciences — target discovery
·Government — policy documents
·Education — curriculum standards
·Supply chain — recall tracking

Explicit boundary

Queries outside the active domain boundary return NOT_COVERED with a domain explanation — never a fabricated answer.

{

"result": "NOT_COVERED",

"domain": "legal",

"reason": "outside validated boundary"

}

In Progress · Q2–Q3 2026

SourceCheckup self-evaluation

We are replicating the exact methodology from the Stanford SourceCheckup study (Nature Communications, April 2025) on Skippy — the same 800-question set, the same physician-adjudicated rubric, scored against the same seven frontier LLMs. This uses an external, peer-reviewed standard we did not design.

800 questions

Mayo Clinic Q&A + Reddit r/AskDocs

Same question set as the Stanford paper — stratified by specialty and complexity. Skippy's results will be reported against the same 800-question subset for direct comparability.

3 independent physicians

Fleiss' κ ≥ 0.60 · raw agreement ≥ 80%

Each (statement, cited source) pair adjudicated independently. Majority vote of 3. Both Fleiss' κ and raw percent agreement reported — matching the paper's published standards.

Same 7-LLM comparison

GPT-4o, Gemini, Claude, Llama, Mistral, Perplexity

Head-to-head against GPT-4o + Web Search and the six other LLMs from the original paper. Same three metrics: statement-level support, response-level support, citation validity — all with 95% Wilson CIs.

Why the architecture changes the result

The Stanford study found that frontier LLMs support only ~50% of their own cited statements at the statement level. Skippy's architecture forecloses the mechanism that produces that failure: every output must be gated by the verifier before delivery, and the citation target is a BeliefV1 node with verified provenance — not a URL or a document chunk. The verifier either confirms entailment between the response statement and the resolved source text, or the response is not delivered. There is no path to issuing a statement that contradicts its own citation. The SourceCheckup results will quantify the magnitude of that structural difference.

Model	Statement support	Response support	Citation valid
Skippy	Publishing Q3 2026	Publishing Q3 2026	Publishing Q3 2026
GPT-4o + Web Search	~50%	~33%	~72%
GPT-4o	~48%	~31%	—
Gemini Pro	~45%	~28%	—
Claude 3	~47%	~30%	—
Llama 3	~43%	~26%	—
Mistral	~41%	~25%	—
Perplexity	~52%	~35%	~68%

Baseline figures approximate — exact numbers from Stanford SourceCheckup (Nature Communications, Apr 2025). Skippy column publishes Q3 2026 with 95% Wilson confidence intervals.

Publication commitment

Results will be published in full — methodology, the complete comparison table, and confidence intervals — regardless of outcome. Whitepaper submitted for editorial consideration at NEJM AI. We will not publish selectively favorable results.

·Full whitepaper (PDF + landing page) published with physician adjudication detail

·Quarterly CI regression gate enforced automatically after initial publication

·Annual physician re-adjudication for headline metric refresh

·Submitted to NEJM AI for external peer review

Baseline comparison

Skippy vs. GPT-4 on 500 medical claims

The benchmark uses the same 500 drug interaction questions and the same clinical pharmacist gold labels. GPT-4 runs without grounding — the baseline that reflects how general AI is currently deployed in clinical tooling.

The architectural difference

GPT-4 produces confident classifications whether or not the evidence supports them. On contested interactions — where evidence is partial, conflicting, or sparse — it fabricates plausible-sounding answers.

Skippy doesn't reduce that error rate. It eliminates the mechanism. Every output is either a verified finding traced to a specific, versioned source — or an explicit abstention with a calibrated confidence score. There is no path to a fabricated classification.

A general LLM classifying a CONTRAINDICATED pair as safe produces a patient event. The benchmark exists to show that error mode is architecturally impossible in Skippy — not merely less likely.

GPT-4 (no grounding)

~50%error rate on contested interactions

Produces confident wrong classifications when evidence is sparse

Stanford SourceCheckup baseline (Nature Communications, Apr 2025)

Skippy

0fabricated findings

Verified finding with source lineage — or explicit calibrated abstention. No third option.

On insufficient evidence

GPT-4

Confident guess

Skippy

Calibrated abstention

Deployment gates

Calibration failure blocks deployment. Automatically.

The benchmark is not a report — it's a gate. Every evidence update runs the full benchmark suite before deployment. If a gate fails, deployment is blocked until the issue is resolved.

Metric	Required	Current (Q2 2026)	On failure
ECE on drug interactions	< 0.10	0.07	Deployment blocked
CONTRAINDICATED failure rate	< 10%	< 2%	Hard block — cannot be overridden
Retracted claim passthrough	< 5%	< 2%	Deployment blocked
Cochrane alignment ECE	< 0.15	0.09	Deployment blocked
Confidence–accuracy correlation	> 0.80	0.91	Deployment blocked
False-positive CONTRAINDICATED rate	< 25%	< 12%	Deployment blocked

Last run: May 2026 · All 6 gates: PASS · Mapped to FDA SaMD PCCP requirements and EU AI Act Article 15 accuracy monitoring obligations

Vendor Evaluation Checklist

What compliance teams ask for. Where to find it.

Common criteria from CISO and compliance officer AI vendor evaluation checklists, mapped to what is documented on this page.

✓

Calibration metrics published with confidence intervals

ECE 0.07 (95% CI: 0.04–0.10) · AUROC 0.91 (95% CI: 0.88–0.94) — Validated Results section above

✓

Independent expert labeling with inter-rater reliability reported

Board-certified PharmD · Fleiss' κ = 0.81 · FDA labels + DrugBank + MCG — Three Test Sets section above

✓

Dataset composition and curation process documented

707-item benchmark: 624 DDI + 100 retraction + 100 Cochrane · each test set detailed above

✓

Hard deployment gates — enforcement, not advisory monitoring

6 blocking gates · any failure halts deployment (exit 1) · open-source skippy-pccp — Deployment Gates section above

✓

Evaluation tools independently runnable without vendor access

Apache 2.0 · skippy-eval, skippy-pccp, skippy-verify · pip install · runs against any Skippy API key

✓

All benchmark runs logged — failures included, none suppressed

Version history table above — quarterly cadence · includes any future failures by policy

✓

Evaluation framework alignment documented (TRIPOD-AI, FDA GMLP)

TRIPOD-AI aligned · FDA GMLP aligned · NIST AI 800-2 aligned — Evaluation Frameworks section above

○

External peer review or independent site validation

SourceCheckup physician study in progress (Q2–Q3 2026) · dataset release Q3 2026 · NEJM AI submission planned

○

Subgroup analysis (pediatric, geriatric, renal impairment)

Subgroup analysis planned for Q3 2026 dataset release — see Limitations section below

Missing something from your checklist? Contact us →

Reading confidence scores

What each confidence range means for clinical use

At ECE < 0.10, a stated confidence of 0.90 empirically corresponds to approximately 88–92% accuracy. These thresholds translate the calibrated score into a clinical workflow recommendation.

Confidence	Evidence quality	Recommended use
≥ 0.90	High confidence	Suitable for automated decision support with audit record
0.70 – 0.89	Supported	Recommend human review for high-stakes decisions
0.30 – 0.69	Uncertain	Flag for mandatory clinician review — do not automate
< 0.30	Insufficient evidence	Do not use for clinical decisions — escalate to expert review

Thresholds derived from ECE calibration study. At ECE 0.07 on drug interactions, a score of 0.90 corresponds to approximately 88–92% empirical accuracy.

Open benchmark commitment

We will publish the dataset and invite external replication.

Simultaneous with the SourceCheckup whitepaper publication (Q3 2026), we will release the full test sets (DDI, Retraction, Cochrane), evaluation scripts, and results — along with Docker images for exact reproduction. External researchers are invited to run their own models and compare against the same gold standard. We will co-author with any external team that submits independently reproduced results.

707-item test sets published — Q3 2026

Evaluation scripts open source (Apache 2.0)

Docker image for exact reproduction

ArXiv preprint + NEJM AI submission

Enterprise teams can request early access to the dataset for independent evaluation prior to public release.

Request dataset access →

Appropriate Use & Limitations

What this benchmark covers — and what it doesn't

Responsible use of this data requires understanding where the validation was performed and where results may not generalize. These limitations are published proactively — not because they are disqualifying, but because obscuring them would undermine the credibility the benchmark is meant to establish.

Adult patients with primary care drug interaction queries

Standard drug pairs represented in FDA labeling and DrugBank

Drug-disease indications covered by Cochrane systematic reviews

Retraction detection for PubMed-indexed biomedical literature

Known limitations

Pediatric, geriatric, and severe renal/hepatic impairment subgroups are underrepresented — subgroup analysis planned for Q3 2026 dataset release.

Oncology drug interactions (e.g., chemotherapy regimens, targeted therapy combinations) are outside the current validated scope.

Performance may vary for novel biologics approved after the benchmark dataset cutoff where FDA labeling coverage is limited.

External site validation is pending (Q3 2026). Results at institutions with significantly different prescribing patterns may differ from the benchmark.

The SourceCheckup LLM head-to-head comparison has not yet been independently peer-reviewed — results publish Q3 2026.

Interested in the benchmark study?

We are working with clinical pharmacists and external reviewers to run the full study. If you want early access to results or want to collaborate on the external validation, reach out.