The enforcement layer behind the claims
Every accuracy claim Skippy makes is backed by tooling that enforces it — not monitors it, not reports on it: enforces it. Three open-source packages enforce the deployment gate; a fourth study validates against an external, peer-reviewed standard we did not design.
None of these tools are post-hoc dashboards. They are pre-deploy gates. A code change that degrades calibration, introduces severity drift, or increases retracted-source passthrough cannot be deployed — the gate runner exits non-zero and blocks the pipeline.
skippy-pccp
FDA SaMD Predetermined Change Control Plan gate runner
The FDA requires a Predetermined Change Control Plan (PCCP) before approving Software as a Medical Device (SaMD). It defines the gates a system must pass before any change can be deployed. skippy-pccp automates those gates — every code change runs the full suite, and deployment is blocked on failure.
Calibration (ECE)
Expected Calibration Error must stay below 0.10 on the held-out drug interaction test set. Measures whether stated confidence tracks observed accuracy.
CONTRAINDICATED Preservation
The highest-severity interaction class must be classified correctly on the golden set. A CONTRAINDICATED pair classified as safe is a patient event — this gate cannot be overridden.
Severity Drift
Measures whether a code change introduces systematic severity downgrades vs. the prior passing run. A MAJOR interaction classified as MINOR after a code change is a regression.
Calibration Drift
Even if absolute ECE passes Gate 1, a large jump in ECE relative to the prior release indicates a calibration regression introduced by the change.
Retraction Monitoring
Validates that the ingest pipeline's retraction detection is still working. Sourced from retraction_summary.json produced by skippy-ingest.
If a code change degrades calibration — say ECE rises from 0.07 to 0.12 — deployment is immediately blocked. The CI pipeline receives exit 1, the run is logged with the failing value and timestamp, and no deployment proceeds until all 5 gates pass in a fresh run.
This is not a flag or a review queue — it is a hard stop. Skippy Auth cannot ship a version that downgrades Gate 2 (CONTRAINDICATED Preservation) because that gate directly protects against a clinical patient safety event.
| Gate | Threshold | Data source | PCCP section | Audit log field |
|---|---|---|---|---|
| Gate 1 — ECE | < 0.10 | skippy-ece-benchmark DDI (n=624) | Modification Protocol §3.1 — Calibration | gate_1_ece |
| Gate 2 — CONTRAINDICATED | < 10% failure | PharmD golden set · κ=0.81 | Modification Protocol §3.2 — Safety | gate_2_contraindicated_failure_rate |
| Gate 3 — Severity Drift | No ≥2-level downgrades | Versioned golden comparison | Impact Assessment §4.1 — Regression | gate_3_severity_drift |
| Gate 4 — Calibration Drift | ECE Δ ≤ 0.02 | Diff vs prior PCCP run | Impact Assessment §4.2 — Drift | gate_4_ece_delta |
| Gate 5 — Retraction | < 5% passthrough | skippy-ingest retraction_summary.json | Modification Protocol §3.3 — Data | gate_5_retraction_passthrough |
skippy-eval
Domain-specific benchmark and evaluation harness
skippy-eval is the evaluation harness that runs purpose-built test sets against each Skippy product domain. It produces structured output — IS-SUP and Self-RAG scores, per-item audit logs, and aggregate metrics — that feeds into the PCCP gate runner. The harness covers 40+ verticals, each with its own gold-standard labels and domain-specific scoring criteria.
Measures source support quality — whether the evidence cited by the response actually supports the claims made. Produces a per-sentence IS-SUP score and an aggregate for the full response.
Implements the Self-RAG evaluation protocol — measuring retrieval quality, relevance of retrieved passages to the query, and faithfulness of the generated output to those passages.
- ·Drug interactions
- ·Drug indications
- ·Retraction detection
- ·Rare disease
- ·Pharmacogenomics
- ·Prior auth
- ·Medical coding
- ·Pharmacovigilance
- ·Biomarker validation
- ·Estate law
- ·Tax compliance
- ·Federal contracting
- ·Insurance coverage
- ·Audit & waste
- ·Content moderation
- ·Children's privacy
- ·Crypto compliance
- ·TPRM
- ·Aviation safety
- ·Maritime compliance
- ·Transportation (HOS/HazMat)
- ·Food safety
- ·Construction safety
- ·Mining compliance
- ·Environmental (ESG)
- ·Animal welfare
- ·Banking compliance
- ·Investment advisor
- ·Real estate
- ·Election finance
- ·Government programs
- ·Nonprofit compliance
- ·Trade compliance
Every eval run produces JSON Schema-validated audit logs — immutable records of the dataset version, scorer version, and per-item scores at the time of the run. These logs are the primary input to skippy-pccp Gate 1 and can be submitted as part of an FDA SaMD performance documentation package.
skippy-ece-benchmark
707-item confidence calibration study
The benchmark dataset and evaluation scripts behind the ECE < 0.10 claim. A 707-item clinical study across drug interaction classification, retracted-paper detection, and Cochrane systematic review alignment — each with gold-standard labels from a clinical pharmacist. The output feeds directly into PCCP Gate 1.
500 drug pairs from FDA labels, manually labeled. Five severity tiers. Gate threshold: ECE < 0.10.
Drug-disease associations. Cochrane systematic review outcomes as gold standard. Gate threshold: ECE < 0.10.
100 biomedical claims from retracted papers. Retraction Watch database as gold standard. Gate threshold: < 5% passthrough.
| Run | ECE (DDI) | AUROC | Gate 1 | Deploy |
|---|---|---|---|---|
| Q4 2025 · v1.0 | 0.08 | 0.89 | PASS | PASSED |
| Q2 2026 · v1.1 (current) | 0.07 | 0.91 | PASS | PASSED |
| Q4 2026 · v1.2 (target) | < 0.06 | > 0.92 | — | TARGET |
SourceCheckup Self-Evaluation
Skippy scored against Stanford's peer-reviewed citation accuracy rubric
The Stanford SourceCheckup study (Nature Communications, April 2025) found that 50–90% of medical AI responses are not fully supported by their own cited sources — across seven frontier LLMs including GPT-4o. We are replicating the exact same methodology on Skippy. Same 800-question set. Same physician-adjudicated rubric. Same comparison baselines. A standard we did not design.
The paper's headline evaluation set — stratified by specialty and complexity. Skippy's results will be published against the same subset for direct comparability.
Each (statement, cited source) pair adjudicated independently. Majority vote of 3. Inter-rater agreement reported alongside the headline results — matching the paper's achieved 86.1% raw agreement.
GPT-4o + Web Search and the six other models from the original study. Same rubric: statement-level support rate, response-level support rate, and citation validity — all with 95% Wilson confidence intervals.
The paper's seven LLMs were evaluated on their own citation outputs. Skippy is evaluated on its own citation output — BeliefV1 node IDs resolved through the provenance graph to primary source text. The rubric is identical: does the cited source actually support the statement?
Skippy is not getting a Skippy-friendly rubric. The difference is the citation target — Skippy cites a structured belief with lineage rather than a URL. The physician adjudicators see the resolved source text, not the internal node ID. The comparison is apples-to-apples.
Results will be published in full — methodology, side-by-side comparison table, and confidence intervals — regardless of outcome. Whitepaper submitted to NEJM AI for editorial consideration. Selective publication is not an option we have chosen.
| Model | Stmt support | Resp support | Citation valid |
|---|---|---|---|
| Skippy | active collection | active collection | active collection |
| GPT-4o + Web Search | paper | paper | paper |
| GPT-4o | paper | paper | paper |
| Gemini Pro | paper | paper | paper |
| Claude 3 | paper | paper | paper |
| Llama 3 | paper | paper | paper |
| Mistral | paper | paper | paper |
| Perplexity | paper | paper | paper |
Baseline numbers from Stanford SourceCheckup (Nature Communications, Apr 2025, DOI 10.1038/s41467-025-58551-6). The paper reports 50–90% of responses across 7 LLMs were not fully supported by their own cited sources. The best-performing model achieved ~50% response-level support. Skippy results publish Q3 2026.
skippy-transparency
Merkle tree transparency log and cryptographic response signing
Every signed Skippy response is produced by skippy-transparency — the server-side signing service that maintains the append-only Merkle tree and issues Ed25519-signed responses. skippy-verify (the open-source client) verifies what skippy-transparency produces. Together they form a complete, auditable chain from response issuance to independent third-party verification.
Once a response is issued, verification requires no Skippy infrastructure. A response from 2026 can be verified in 2030 using the public key archived at the time of signing. This is the property FDA SaMD PCCP and EU AI Act Article 13 require.
A CMS audit of a prior authorization decision built on Skippy would proceed as follows: the auditor downloads the response JSON, runs skippy-verify response.json locally, and receives an independent PASS/FAIL with the Merkle inclusion proof and Sigstore Rekor timestamp — no Skippy access required, no cooperation required, no ability for Skippy to alter the record after the fact.
Enforcement, not monitoring
The PCCP gate history is a production audit trail logged with dataset version, gate thresholds, and pass/fail result. The traceability table above maps each gate to its PCCP section and audit log field — structured for direct inclusion in a pre-submission package.
Article 15 requires documented accuracy monitoring with predefined thresholds. Article 13 requires technical documentation of how outputs are derived. Article 9 requires an ongoing risk management system. ECE gates, severity drift checks, and the cryptographic audit trail satisfy all three — by construction, not by policy.
When your procurement checklist asks 'how do you maintain accuracy?' — this is the answer. Not a process description. An open-source CLI you can run against Skippy's API yourself, producing the same exit codes and PCCP PDF that feed our production pipeline.
Request vendor validation package →The validation gate is open-source under Apache 2.0. You can install skippy-pccpand run it against Skippy's API from your own environment — you receive the same exit codes, the same JSON gate output, and the same PDF report that our production CI pipeline uses. No cooperation required from Skippy. No trust required in our reports.
Questions about the validation infrastructure?
We can walk through the PCCP gate history, the calibration methodology, and what the audit logs look like for your specific regulatory context.