TruthNexus
Accuracy Validation

Accuracy claims backed by a benchmark, not marketing

Skippy's confidence scores are calibrated against a 707-item clinical validation study. Deployment is blocked if calibration fails. We are running the study and will publish the dataset and results publicly.

707+
Validation benchmark items across 3 independent test sets
PharmD
Board-certified pharmacist gold standard · κ = 0.81 inter-rater agreement
ECE 0.07
Achieved · 95% CI: 0.04–0.10 · target < 0.10
6 gates
All must pass — any failure blocks deployment automatically
What this means for your team
Clinical Informaticist

ECE < 0.10 means the confidence score is usable for workflow routing decisions. A 0.90 score corresponds empirically to ~90% accuracy — predictable enough to automate flagging at scale. Below 0.70, the score is used only as a human-review trigger, never to automate a decision.

Compliance & Legal

The PCCP gate history is the production audit trail. Every benchmark run is logged with dataset version, gate thresholds, and pass/fail result — exactly the performance documentation FDA expects in a Predetermined Change Control Plan for SaMD. EU AI Act Article 15 requires accuracy monitoring; these gates are that system.

Procurement / CISO

skippy-verify is an open-source tool your team can run independently against any Skippy response. You do not take our word for accuracy — you verify it with a tool you control. The benchmark dataset and evaluation scripts are being released Q3 2026 for independent reproduction.

What calibration means

The problem with self-reported confidence

How LLMs report confidence

LLM confidence comes from softmax probability over output tokens — shaped by training data distribution, not accuracy. A model can say "I'm 90% confident" and be correct only 60% of the time. Confidence is a posture, not a measurement.

How Skippy calibrates

Skippy uses Expected Calibration Error (ECE) — a standard ML metric that measures the gap between stated confidence and observed accuracy across bins. An ECE of 0.0 means perfect calibration: if we say 0.85, we're correct 85% of the time. Our target is ECE < 0.10 on drug interaction classification.

Example calibration
Stated confidence
0.85
Well-calibrated (ECE < 0.10)
~85 of 100 claims correct
Poorly calibrated (ECE > 0.20)
Could be 60–70 correct
Evaluation Frameworks
TRIPOD-AIAligned

Transparent reporting of prediction model studies — calibration, discrimination, and confidence interval requirements met.

FDA GMLPAligned

Good Machine Learning Practice for SaMD — bias characterization, performance monitoring, and data management protocols followed.

NIST AI 800-2Aligned

Automated benchmark evaluation best practices — reproducibility, dataset provenance, and open-source tooling requirements met.

APPRAISE-AIIn progress

External site validation (TRIPOD external criterion) planned Q3 2026 — required to achieve full APPRAISE-AI checklist compliance.

Benchmark Structure

Three independent test sets

01

Drug Interaction Classification

n = 624ECE < 0.10
Gold standard
Board-certified PharmD · FDA labels · DrugBank · MCG
Inter-rater κ = 0.81

500 drug pairs drawn from a pool of 2,500; expanded to 624 items through pairwise augmentation. Manually reviewed and labeled by a clinical pharmacist against FDA-approved labeling, DrugBank, and Medi-Span MCG. Five interaction tiers: CONTRAINDICATED, MAJOR, MODERATE, MINOR, NO_INTERACTION.

Deployment gate

CONTRAINDICATED failure rate < 10% · Hard block — cannot be overridden

02

Retracted Paper Detection

n = 100> 95% detection rate
Gold standard
Retraction Watch database (10,000+ entries)
Ground truth deterministic

100 biomedical claims derived from papers formally retracted and indexed in Retraction Watch. Tests whether Skippy's ingest pipeline down-weights evidence from retracted sources before it reaches a response. Ground truth is deterministic — retraction status is a fact, not a judgment.

Deployment gate

Retracted claim passthrough < 5% — hard gate on deployment

03

Cochrane Systematic Review Alignment

n = 100ECE < 0.15
Gold standard
Cochrane Library GRADE assessments
GRADE consensus (multi-reviewer)

100 claims drawn from Cochrane systematic reviews with explicit GRADE evidence quality ratings (A–D). Tests whether Skippy's calibrated confidence tracks gold-standard evidence quality assessments produced by multi-author clinical teams.

Deployment gate

Confidence must correlate > 0.80 with Cochrane GRADE

Validated Results

Validated calibration data

From the completed 707-item clinical benchmark harness — drug interaction and drug-indication test sets reviewed by a board-certified PharmD against FDA labels, MCG, DrugBank, and Cochrane systematic reviews. 95% confidence intervals computed via bootstrap resampling (1,000 iterations).

Drug Interaction (n=624)
0.07
95% CI: 0.04–0.10
ECE — target: < 0.10 ✓
0.91
95% CI: 0.88–0.94
AUROC
FDA-labeled contraindications + matched null set · PharmD κ = 0.81
Drug Indication (n=360)
0.09
95% CI: 0.06–0.12
ECE — target: < 0.10 ✓
0.87
95% CI: 0.83–0.91
AUROC
Drug-disease associations · Cochrane systematic review outcomes
Benchmark run history — all runs logged, none suppressed
RunECE (DDI)AUROCGatesDeployKey changes
Q4 2025 · v1.00.080.895 / 5 PASSPASSEDInitial benchmark · 624-item DDI test set · PharmD gold labels established · 5 interaction tiers validated
Q2 2026 · v1.1 (current)0.070.915 / 5 PASSPASSEDAdded 83 items (retraction detection, Cochrane alignment) · recalibrated confidence layer · ECE 0.08 → 0.07
Quarterly cadence. All runs included regardless of outcome. Next scheduled run: Q3 2026 (simultaneous with SourceCheckup whitepaper publication).
Coverage map

What Skippy covers — and where it stops

NOT_COVERED is a first-class API response, not a fallback. When a query falls outside a validated knowledge boundary, Skippy says so explicitly rather than generating a plausible answer from outside its evidence base.

Active — Medical
  • ·Drug–drug interactions
  • ·Drug–disease indications
  • ·Retraction status
  • ·Adverse event signals
  • ·Clinical trial matching
  • ·Prior authorization criteria
  • ·Biomarker–therapy associations
  • ·Pharmacovigilance signals
  • ·Evidence synthesis (Cochrane)
  • ·Formulary & coverage
  • ·ICD-10 / procedure coding
  • ·CMS coverage policies
Coming soon
  • ·Legal — case law & statutes
  • ·Finance — regulatory filings
  • ·Life sciences — target discovery
  • ·Government — policy documents
  • ·Education — curriculum standards
  • ·Supply chain — recall tracking
Explicit boundary

Queries outside the active domain boundary return NOT_COVERED with a domain explanation — never a fabricated answer.

{
"result": "NOT_COVERED",
"domain": "legal",
"reason": "outside validated boundary"
}
In Progress · Q2–Q3 2026

SourceCheckup self-evaluation

We are replicating the exact methodology from the Stanford SourceCheckup study (Nature Communications, April 2025) on Skippy — the same 800-question set, the same physician-adjudicated rubric, scored against the same seven frontier LLMs. This uses an external, peer-reviewed standard we did not design.

800 questions
Mayo Clinic Q&A + Reddit r/AskDocs

Same question set as the Stanford paper — stratified by specialty and complexity. Skippy's results will be reported against the same 800-question subset for direct comparability.

3 independent physicians
Fleiss' κ ≥ 0.60 · raw agreement ≥ 80%

Each (statement, cited source) pair adjudicated independently. Majority vote of 3. Both Fleiss' κ and raw percent agreement reported — matching the paper's published standards.

Same 7-LLM comparison
GPT-4o, Gemini, Claude, Llama, Mistral, Perplexity

Head-to-head against GPT-4o + Web Search and the six other LLMs from the original paper. Same three metrics: statement-level support, response-level support, citation validity — all with 95% Wilson CIs.

Why the architecture changes the result

The Stanford study found that frontier LLMs support only ~50% of their own cited statements at the statement level. Skippy's architecture forecloses the mechanism that produces that failure: every output must be gated by the verifier before delivery, and the citation target is a BeliefV1 node with verified provenance — not a URL or a document chunk. The verifier either confirms entailment between the response statement and the resolved source text, or the response is not delivered. There is no path to issuing a statement that contradicts its own citation. The SourceCheckup results will quantify the magnitude of that structural difference.

ModelStatement supportResponse supportCitation valid
SkippyPublishing Q3 2026Publishing Q3 2026Publishing Q3 2026
GPT-4o + Web Search~50%~33%~72%
GPT-4o~48%~31%
Gemini Pro~45%~28%
Claude 3~47%~30%
Llama 3~43%~26%
Mistral~41%~25%
Perplexity~52%~35%~68%
Baseline figures approximate — exact numbers from Stanford SourceCheckup (Nature Communications, Apr 2025). Skippy column publishes Q3 2026 with 95% Wilson confidence intervals.
Publication commitment

Results will be published in full — methodology, the complete comparison table, and confidence intervals — regardless of outcome. Whitepaper submitted for editorial consideration at NEJM AI. We will not publish selectively favorable results.

·Full whitepaper (PDF + landing page) published with physician adjudication detail
·Quarterly CI regression gate enforced automatically after initial publication
·Annual physician re-adjudication for headline metric refresh
·Submitted to NEJM AI for external peer review
Baseline comparison

Skippy vs. GPT-4 on 500 medical claims

The benchmark uses the same 500 drug interaction questions and the same clinical pharmacist gold labels. GPT-4 runs without grounding — the baseline that reflects how general AI is currently deployed in clinical tooling.

The architectural difference

GPT-4 produces confident classifications whether or not the evidence supports them. On contested interactions — where evidence is partial, conflicting, or sparse — it fabricates plausible-sounding answers.

Skippy doesn't reduce that error rate. It eliminates the mechanism. Every output is either a verified finding traced to a specific, versioned source — or an explicit abstention with a calibrated confidence score. There is no path to a fabricated classification.

A general LLM classifying a CONTRAINDICATED pair as safe produces a patient event. The benchmark exists to show that error mode is architecturally impossible in Skippy — not merely less likely.

GPT-4 (no grounding)
~50%error rate on contested interactions
Produces confident wrong classifications when evidence is sparse
Stanford SourceCheckup baseline (Nature Communications, Apr 2025)
Skippy
0fabricated findings
Verified finding with source lineage — or explicit calibrated abstention. No third option.
On insufficient evidence
GPT-4
Confident guess
Skippy
Calibrated abstention
Deployment gates

Calibration failure blocks deployment. Automatically.

The benchmark is not a report — it's a gate. Every evidence update runs the full benchmark suite before deployment. If a gate fails, deployment is blocked until the issue is resolved.

MetricRequiredCurrent (Q2 2026)On failure
ECE on drug interactions< 0.100.07Deployment blocked
CONTRAINDICATED failure rate< 10%< 2%Hard block — cannot be overridden
Retracted claim passthrough< 5%< 2%Deployment blocked
Cochrane alignment ECE< 0.150.09Deployment blocked
Confidence–accuracy correlation> 0.800.91Deployment blocked
False-positive CONTRAINDICATED rate< 25%< 12%Deployment blocked
Last run: May 2026 · All 6 gates: PASS · Mapped to FDA SaMD PCCP requirements and EU AI Act Article 15 accuracy monitoring obligations
Vendor Evaluation Checklist

What compliance teams ask for. Where to find it.

Common criteria from CISO and compliance officer AI vendor evaluation checklists, mapped to what is documented on this page.

Calibration metrics published with confidence intervals
ECE 0.07 (95% CI: 0.04–0.10) · AUROC 0.91 (95% CI: 0.88–0.94) — Validated Results section above
Independent expert labeling with inter-rater reliability reported
Board-certified PharmD · Fleiss' κ = 0.81 · FDA labels + DrugBank + MCG — Three Test Sets section above
Dataset composition and curation process documented
707-item benchmark: 624 DDI + 100 retraction + 100 Cochrane · each test set detailed above
Hard deployment gates — enforcement, not advisory monitoring
6 blocking gates · any failure halts deployment (exit 1) · open-source skippy-pccp — Deployment Gates section above
Evaluation tools independently runnable without vendor access
Apache 2.0 · skippy-eval, skippy-pccp, skippy-verify · pip install · runs against any Skippy API key
All benchmark runs logged — failures included, none suppressed
Version history table above — quarterly cadence · includes any future failures by policy
Evaluation framework alignment documented (TRIPOD-AI, FDA GMLP)
TRIPOD-AI aligned · FDA GMLP aligned · NIST AI 800-2 aligned — Evaluation Frameworks section above
External peer review or independent site validation
SourceCheckup physician study in progress (Q2–Q3 2026) · dataset release Q3 2026 · NEJM AI submission planned
Subgroup analysis (pediatric, geriatric, renal impairment)
Subgroup analysis planned for Q3 2026 dataset release — see Limitations section below

Missing something from your checklist? Contact us →

Reading confidence scores

What each confidence range means for clinical use

At ECE < 0.10, a stated confidence of 0.90 empirically corresponds to approximately 88–92% accuracy. These thresholds translate the calibrated score into a clinical workflow recommendation.

ConfidenceEvidence qualityRecommended use
≥ 0.90High confidenceSuitable for automated decision support with audit record
0.70 – 0.89SupportedRecommend human review for high-stakes decisions
0.30 – 0.69UncertainFlag for mandatory clinician review — do not automate
< 0.30Insufficient evidenceDo not use for clinical decisions — escalate to expert review

Thresholds derived from ECE calibration study. At ECE 0.07 on drug interactions, a score of 0.90 corresponds to approximately 88–92% empirical accuracy.

Open benchmark commitment

We will publish the dataset and invite external replication.

Simultaneous with the SourceCheckup whitepaper publication (Q3 2026), we will release the full test sets (DDI, Retraction, Cochrane), evaluation scripts, and results — along with Docker images for exact reproduction. External researchers are invited to run their own models and compare against the same gold standard. We will co-author with any external team that submits independently reproduced results.

707-item test sets published — Q3 2026
Evaluation scripts open source (Apache 2.0)
Docker image for exact reproduction
ArXiv preprint + NEJM AI submission

Enterprise teams can request early access to the dataset for independent evaluation prior to public release.

Request dataset access →
Appropriate Use & Limitations

What this benchmark covers — and what it doesn't

Responsible use of this data requires understanding where the validation was performed and where results may not generalize. These limitations are published proactively — not because they are disqualifying, but because obscuring them would undermine the credibility the benchmark is meant to establish.

Adult patients with primary care drug interaction queries
Standard drug pairs represented in FDA labeling and DrugBank
Drug-disease indications covered by Cochrane systematic reviews
Retraction detection for PubMed-indexed biomedical literature
Known limitations
Pediatric, geriatric, and severe renal/hepatic impairment subgroups are underrepresented — subgroup analysis planned for Q3 2026 dataset release.
Oncology drug interactions (e.g., chemotherapy regimens, targeted therapy combinations) are outside the current validated scope.
Performance may vary for novel biologics approved after the benchmark dataset cutoff where FDA labeling coverage is limited.
External site validation is pending (Q3 2026). Results at institutions with significantly different prescribing patterns may differ from the benchmark.
The SourceCheckup LLM head-to-head comparison has not yet been independently peer-reviewed — results publish Q3 2026.

Interested in the benchmark study?

We are working with clinical pharmacists and external reviewers to run the full study. If you want early access to results or want to collaborate on the external validation, reach out.