Validation Infrastructure

The enforcement layer behind the claims

Every accuracy claim Skippy makes is backed by tooling that enforces it — not monitors it, not reports on it: enforces it. Three open-source packages enforce the deployment gate; a fourth study validates against an external, peer-reviewed standard we did not design.

0.07

ECE — Drug Interaction · n=624

95% CI: 0.04–0.10 · target < 0.10

5 / 5

Gates passing · May 2026

Apache 2.0

All tools open source

Quick read5/5 PCCP gates pass · any failure halts deployment·ECE 0.07 · 707-item clinical study · PharmD gold standard · κ = 0.81·All tools open-source (Apache 2.0) · run locally, no Skippy access required·HIPAA ready · GDPR compliant · SOC 2 Type II in audit Q4 2026

How They Connect

skippy-eval

Runs 40+ domain-specific test sets against the API

→

skippy-ece-benchmark

707-item calibration study produces ECE score

→

skippy-pccp

Reads eval output, runs 5 FDA gates

→

Deploy

Proceeds only if all 5 gates pass (exit code 0)

None of these tools are post-hoc dashboards. They are pre-deploy gates. A code change that degrades calibration, introduces severity drift, or increases retracted-source passthrough cannot be deployed — the gate runner exits non-zero and blocks the pipeline.

Open Source · Apache 2.0

skippy-pccp

FDA SaMD Predetermined Change Control Plan gate runner

The FDA requires a Predetermined Change Control Plan (PCCP) before approving Software as a Medical Device (SaMD). It defines the gates a system must pass before any change can be deployed. skippy-pccp automates those gates — every code change runs the full suite, and deployment is blocked on failure.

CLI

$ skippy-pccp run --format json

# Output

{

"gates": [

{ "id": "Gate 1", "name": "Calibration (ECE)", "result": "PASS", "ece": 0.07 },

{ "id": "Gate 2", "name": "CONTRAINDICATED Preservation", "result": "PASS" }

"overall": "PASS",

"deployable": true

}

exit 0 — all gates pass, deploy proceeds

exit 1 — gate failure, deploy blocked

exit 2 — runner error, deploy blocked

Five gates · All required · Any failure blocks deployment

Gate 1

Calibration (ECE)

ECE < 0.10blocks deploy

Expected Calibration Error must stay below 0.10 on the held-out drug interaction test set. Measures whether stated confidence tracks observed accuracy.

Input source

skippy-ece-benchmark output

Gate 2

CONTRAINDICATED Preservation

< 10% failure rateblocks deploy

The highest-severity interaction class must be classified correctly on the golden set. A CONTRAINDICATED pair classified as safe is a patient event — this gate cannot be overridden.

Input source

Clinical pharmacist golden set

Gate 3

Severity Drift

No ≥ 2-level downgrades · ≤ 5% one-levelblocks deploy

Measures whether a code change introduces systematic severity downgrades vs. the prior passing run. A MAJOR interaction classified as MINOR after a code change is a regression.

Input source

Versioned golden comparison

Gate 4

Calibration Drift

ECE change ≤ 0.02 vs. prior runblocks deploy

Even if absolute ECE passes Gate 1, a large jump in ECE relative to the prior release indicates a calibration regression introduced by the change.

Input source

Diff against prior PCCP run

Gate 5

Retraction Monitoring

< 5% retracted-claim passthroughblocks deploy

Validates that the ingest pipeline's retraction detection is still working. Sourced from retraction_summary.json produced by skippy-ingest.

Input source

skippy-ingest retraction summary

What a gate failure looks like

If a code change degrades calibration — say ECE rises from 0.07 to 0.12 — deployment is immediately blocked. The CI pipeline receives exit 1, the run is logged with the failing value and timestamp, and no deployment proceeds until all 5 gates pass in a fresh run.

This is not a flag or a review queue — it is a hard stop. Skippy Auth cannot ship a version that downgrades Gate 2 (CONTRAINDICATED Preservation) because that gate directly protects against a clinical patient safety event.

Example: Gate 1 failure output

$ skippy-pccp run --format json

{

"gate": "Gate 1 — Calibration",

"result": "FAIL",

"ece": 0.12,

"threshold": "< 0.10",

"delta_from_prior": +0.05

"overall": "FAIL",

"deployable": false

exit 1 — CI pipeline blocked · deploy halted

FDA PCCP Traceability — Gate → Data Source → PCCP Section → Audit Field

Gate	Threshold	Data source	PCCP section	Audit log field
Gate 1 — ECE	< 0.10	skippy-ece-benchmark DDI (n=624)	Modification Protocol §3.1 — Calibration	gate_1_ece
Gate 2 — CONTRAINDICATED	< 10% failure	PharmD golden set · κ=0.81	Modification Protocol §3.2 — Safety	gate_2_contraindicated_failure_rate
Gate 3 — Severity Drift	No ≥2-level downgrades	Versioned golden comparison	Impact Assessment §4.1 — Regression	gate_3_severity_drift
Gate 4 — Calibration Drift	ECE Δ ≤ 0.02	Diff vs prior PCCP run	Impact Assessment §4.2 — Drift	gate_4_ece_delta
Gate 5 — Retraction	< 5% passthrough	skippy-ingest retraction_summary.json	Modification Protocol §3.3 — Data	gate_5_retraction_passthrough

Structured to match FDA PCCP marketing submission guidance (2023). Table exportable from skippy-pccp run --format pdf for inclusion in SaMD pre-submission packages.

Which gate protects which product

Gate 1 — ECEAll productsA model with ECE ≥ 0.10 cannot be deployed to any Skippy endpoint. Calibration is the baseline safety property the entire platform requires.

Gate 2 — CONTRAINDICATEDSkippy Auth, Skippy DDI, Skippy Med-CheckCONTRAINDICATED preservation is specifically required before any prior-authorization or drug-interaction product can ship. Gate 2 failure means Skippy Auth cannot be deployed — period.

Gate 3 — Severity DriftSkippy Auth, Skippy Variants, Skippy RareSystematic severity downgrades in any release are blocked. A new model version that reclassifies MAJOR drug interactions as MINOR cannot reach the API layer.

Gate 4 — Calibration DriftAll productsEven a passing absolute ECE can mask a regression. Gate 4 blocks releases where ECE degrades more than 0.02 from the prior passing run, regardless of the absolute value.

Gate 5 — RetractionAll productsSkippy ingest must catch retractions before they reach the verifier. A passthrough rate ≥ 5% means the ingest pipeline has failed — the entire deployment is blocked until the retraction list is re-applied.

Report formats

--format json

JSON

Machine-readable, CI/CD integration

--format jsonl

JSONL

Streaming, line-by-line gate results

--format pdf

PDF

FDA submission-ready report

Open Source · Apache 2.0

skippy-eval

Domain-specific benchmark and evaluation harness

skippy-eval is the evaluation harness that runs purpose-built test sets against each Skippy product domain. It produces structured output — IS-SUP and Self-RAG scores, per-item audit logs, and aggregate metrics — that feeds into the PCCP gate runner. The harness covers 40+ verticals, each with its own gold-standard labels and domain-specific scoring criteria.

CLI

$ skippy-eval run --dataset ddi

$ skippy-eval run --dataset rare-disease

$ skippy-eval run --dataset federal-contracting

$ skippy-eval run --all --output results/

# Scorers

skippy-eval score --scorer is-sup --results results/ddi.json

skippy-eval score --scorer self-rag --results results/ddi.json

IS-SUP scorer

Measures source support quality — whether the evidence cited by the response actually supports the claims made. Produces a per-sentence IS-SUP score and an aggregate for the full response.

// Per-item result

{

"is_sup": 0.92,

"unsupported_spans": 0,

"sources_checked": 4

}

Self-RAG scorer

Implements the Self-RAG evaluation protocol — measuring retrieval quality, relevance of retrieved passages to the query, and faithfulness of the generated output to those passages.

// Per-item result

{

"retrieval_score": 0.88,

"faithfulness": 0.94,

"relevance": 0.91

}

Domain coverage — 40+ vertical test sets

Medical

·Drug interactions
·Drug indications
·Retraction detection
·Rare disease
·Pharmacogenomics
·Prior auth
·Medical coding
·Pharmacovigilance
·Biomarker validation

Legal & Compliance

·Estate law
·Tax compliance
·Federal contracting
·Insurance coverage
·Audit & waste
·Content moderation
·Children's privacy
·Crypto compliance
·TPRM

Industry Safety

·Aviation safety
·Maritime compliance
·Transportation (HOS/HazMat)
·Food safety
·Construction safety
·Mining compliance
·Environmental (ESG)
·Animal welfare

Financial & Government

·Banking compliance
·Investment advisor
·Real estate
·Election finance
·Government programs
·Nonprofit compliance
·Trade compliance

Audit-grade output

Every eval run produces JSON Schema-validated audit logs — immutable records of the dataset version, scorer version, and per-item scores at the time of the run. These logs are the primary input to skippy-pccp Gate 1 and can be submitted as part of an FDA SaMD performance documentation package.

Open Source · Apache 2.0

skippy-ece-benchmark

707-item confidence calibration study

The benchmark dataset and evaluation scripts behind the ECE < 0.10 claim. A 707-item clinical study across drug interaction classification, retracted-paper detection, and Cochrane systematic review alignment — each with gold-standard labels from a clinical pharmacist. The output feeds directly into PCCP Gate 1.

Drug Interaction Classification

Gate 1 input

n = 624

ECE 0.07 · AUROC 0.91

500 drug pairs from FDA labels, manually labeled. Five severity tiers. Gate threshold: ECE < 0.10.

Drug Indication

Gate 1 input

n = 360

ECE 0.09 · AUROC 0.87

Drug-disease associations. Cochrane systematic review outcomes as gold standard. Gate threshold: ECE < 0.10.

Retracted Paper Detection

Gate 5 input

n = 100

> 95% detection rate

100 biomedical claims from retracted papers. Retraction Watch database as gold standard. Gate threshold: < 5% passthrough.

Calibration trajectory — all runs, none suppressed

Run	ECE (DDI)	AUROC	Gate 1	Deploy
Q4 2025 · v1.0	0.08	0.89	PASS	PASSED
Q2 2026 · v1.1 (current)	0.07	0.91	PASS	PASSED
Q4 2026 · v1.2 (target)	< 0.06	> 0.92	—	TARGET

Full benchmark methodology →Includes 95% CIs, confidence interpretation table, and SourceCheckup comparison

In Progress · Q2–Q3 2026

SourceCheckup Self-Evaluation

Skippy scored against Stanford's peer-reviewed citation accuracy rubric

The Stanford SourceCheckup study (Nature Communications, April 2025) found that 50–90% of medical AI responses are not fully supported by their own cited sources — across seven frontier LLMs including GPT-4o. We are replicating the exact same methodology on Skippy. Same 800-question set. Same physician-adjudicated rubric. Same comparison baselines. A standard we did not design.

800 questions

Mayo Clinic Q&A + Reddit r/AskDocs

The paper's headline evaluation set — stratified by specialty and complexity. Skippy's results will be published against the same subset for direct comparability.

3 independent physicians

Fleiss' κ ≥ 0.60 target

Each (statement, cited source) pair adjudicated independently. Majority vote of 3. Inter-rater agreement reported alongside the headline results — matching the paper's achieved 86.1% raw agreement.

Head-to-head vs 7 LLMs

Same baselines as the paper

GPT-4o + Web Search and the six other models from the original study. Same rubric: statement-level support rate, response-level support rate, and citation validity — all with 95% Wilson confidence intervals.

Why this is a fair test

The paper's seven LLMs were evaluated on their own citation outputs. Skippy is evaluated on its own citation output — BeliefV1 node IDs resolved through the provenance graph to primary source text. The rubric is identical: does the cited source actually support the statement?

Skippy is not getting a Skippy-friendly rubric. The difference is the citation target — Skippy cites a structured belief with lineage rather than a URL. The physician adjudicators see the resolved source text, not the internal node ID. The comparison is apples-to-apples.

Publication commitment — active data collection underway, May 2026

Results will be published in full — methodology, side-by-side comparison table, and confidence intervals — regardless of outcome. Whitepaper submitted to NEJM AI for editorial consideration. Selective publication is not an option we have chosen.

·Full side-by-side comparison table vs 7 LLM baselines — Q3 2026

·95% Wilson confidence intervals on all three metrics

·Fleiss' κ and raw percent agreement reported (matching paper standards)

·Methodology and adjudication rubric published for external replication

·Quarterly CI regression gate enforced automatically post-publication

What the comparison table will show (Skippy column publishes Q3 2026)

Model	Stmt support	Resp support	Citation valid
Skippy	active collection	active collection	active collection
GPT-4o + Web Search	paper	paper	paper
GPT-4o	paper	paper	paper
Gemini Pro	paper	paper	paper
Claude 3	paper	paper	paper
Llama 3	paper	paper	paper
Mistral	paper	paper	paper
Perplexity	paper	paper	paper

Baseline numbers from Stanford SourceCheckup (Nature Communications, Apr 2025, DOI 10.1038/s41467-025-58551-6). The paper reports 50–90% of responses across 7 LLMs were not fully supported by their own cited sources. The best-performing model achieved ~50% response-level support. Skippy results publish Q3 2026.

Signing Infrastructure

skippy-transparency

Merkle tree transparency log and cryptographic response signing

Every signed Skippy response is produced by skippy-transparency — the server-side signing service that maintains the append-only Merkle tree and issues Ed25519-signed responses. skippy-verify (the open-source client) verifies what skippy-transparency produces. Together they form a complete, auditable chain from response issuance to independent third-party verification.

Append-only Merkle tree

Every response is appended to a Merkle tree. The root is published and can be independently verified — no Skippy infrastructure required after the fact.

Ed25519 response signing

Each response is signed with an Ed25519 key from the public key registry. Key rotation happens on a 90-day cadence; revocation is propagated through skippy-keys.

Sigstore Rekor integration

Transparency log roots are optionally submitted to Sigstore Rekor — a public, append-only transparency log. Provides an independent third-party timestamp and inclusion proof.

SHA-256 hash chain

Responses are SHA-256 hash-chained in an append-only audit log — tamper-evident by construction. Any modification to a logged response invalidates the chain from that point forward.

How signing and verification connect

Server

skippy-transparency signs response → appends to Merkle tree → embeds signature block in response

response.json delivered to caller

Client

skippy-verify reads signature block → looks up key via skippy-keys → validates Ed25519 signature + Merkle root

verification result: PASS or FAIL with specific failure reason

Auditor

Optional: cross-reference root against Sigstore Rekor public log — independent third-party timestamp

Key point

Once a response is issued, verification requires no Skippy infrastructure. A response from 2026 can be verified in 2030 using the public key archived at the time of signing. This is the property FDA SaMD PCCP and EU AI Act Article 13 require.

Concrete audit scenario

A CMS audit of a prior authorization decision built on Skippy would proceed as follows: the auditor downloads the response JSON, runs skippy-verify response.json locally, and receives an independent PASS/FAIL with the Merkle inclusion proof and Sigstore Rekor timestamp — no Skippy access required, no cooperation required, no ability for Skippy to alter the record after the fact.

What this means

Enforcement, not monitoring

For FDA SaMD submissions

21 CFR Part 820 · QMSPCCP §3 Modification ProtocolPCCP §4 Impact Assessment

The PCCP gate history is a production audit trail logged with dataset version, gate thresholds, and pass/fail result. The traceability table above maps each gate to its PCCP section and audit log field — structured for direct inclusion in a pre-submission package.

For EU AI Act compliance

Article 9 — Risk ManagementArticle 13 — TransparencyArticle 15 — Accuracy & Robustness

Article 15 requires documented accuracy monitoring with predefined thresholds. Article 13 requires technical documentation of how outputs are derived. Article 9 requires an ongoing risk management system. ECE gates, severity drift checks, and the cryptographic audit trail satisfy all three — by construction, not by policy.

For enterprise procurement

skippy-pccp — open source (Apache 2.0)skippy-verify — run yourselfPCCP PDF — submission-ready

When your procurement checklist asks 'how do you maintain accuracy?' — this is the answer. Not a process description. An open-source CLI you can run against Skippy's API yourself, producing the same exit codes and PCCP PDF that feed our production pipeline.

Request vendor validation package →

Run it yourself — you don't need our infrastructure or our permission

The validation gate is open-source under Apache 2.0. You can install skippy-pccpand run it against Skippy's API from your own environment — you receive the same exit codes, the same JSON gate output, and the same PDF report that our production CI pipeline uses. No cooperation required from Skippy. No trust required in our reports.

# Install and run the PCCP gate suite yourself

$ pip install skippy-pccp

$ skippy-pccp run --api-key $YOUR_KEY --format json

# Verify a specific signed response

$ pip install skippy-verify

$ skippy-verify response.json

Certification status

HIPAA Ready · BAA AvailableGDPR CompliantSOC 2 Type II · In audit · Q4 2026ISO 42001 · Gap assessment completeFDA SaMD · PCCP documentedEU AI Act Article 15 · Controls documented7-Year Audit Retention

Questions about the validation infrastructure?

We can walk through the PCCP gate history, the calibration methodology, and what the audit logs look like for your specific regulatory context.

Talk to Us Benchmark methodology →