TruthNexus
Validation Infrastructure

The enforcement layer behind the claims

Every accuracy claim Skippy makes is backed by tooling that enforces it — not monitors it, not reports on it: enforces it. Three open-source packages enforce the deployment gate; a fourth study validates against an external, peer-reviewed standard we did not design.

0.07
ECE — Drug Interaction · n=624
95% CI: 0.04–0.10 · target < 0.10
5 / 5
Gates passing · May 2026
Apache 2.0
All tools open source
Quick read5/5 PCCP gates pass · any failure halts deployment·ECE 0.07 · 707-item clinical study · PharmD gold standard · κ = 0.81·All tools open-source (Apache 2.0) · run locally, no Skippy access required·HIPAA ready · GDPR compliant · SOC 2 Type II in audit Q4 2026
How They Connect
skippy-eval
Runs 40+ domain-specific test sets against the API
skippy-ece-benchmark
707-item calibration study produces ECE score
skippy-pccp
Reads eval output, runs 5 FDA gates
Deploy
Proceeds only if all 5 gates pass (exit code 0)

None of these tools are post-hoc dashboards. They are pre-deploy gates. A code change that degrades calibration, introduces severity drift, or increases retracted-source passthrough cannot be deployed — the gate runner exits non-zero and blocks the pipeline.

Open Source · Apache 2.0

skippy-pccp

FDA SaMD Predetermined Change Control Plan gate runner

The FDA requires a Predetermined Change Control Plan (PCCP) before approving Software as a Medical Device (SaMD). It defines the gates a system must pass before any change can be deployed. skippy-pccp automates those gates — every code change runs the full suite, and deployment is blocked on failure.

CLI
$ skippy-pccp run --format json
# Output
{
"gates": [
{ "id": "Gate 1", "name": "Calibration (ECE)", "result": "PASS", "ece": 0.07 },
{ "id": "Gate 2", "name": "CONTRAINDICATED Preservation", "result": "PASS" }
],
"overall": "PASS",
"deployable": true
}
exit 0 — all gates pass, deploy proceeds
exit 1 — gate failure, deploy blocked
exit 2 — runner error, deploy blocked
Five gates · All required · Any failure blocks deployment
Gate 1

Calibration (ECE)

ECE < 0.10blocks deploy

Expected Calibration Error must stay below 0.10 on the held-out drug interaction test set. Measures whether stated confidence tracks observed accuracy.

Input source
skippy-ece-benchmark output
Gate 2

CONTRAINDICATED Preservation

< 10% failure rateblocks deploy

The highest-severity interaction class must be classified correctly on the golden set. A CONTRAINDICATED pair classified as safe is a patient event — this gate cannot be overridden.

Input source
Clinical pharmacist golden set
Gate 3

Severity Drift

No ≥ 2-level downgrades · ≤ 5% one-levelblocks deploy

Measures whether a code change introduces systematic severity downgrades vs. the prior passing run. A MAJOR interaction classified as MINOR after a code change is a regression.

Input source
Versioned golden comparison
Gate 4

Calibration Drift

ECE change ≤ 0.02 vs. prior runblocks deploy

Even if absolute ECE passes Gate 1, a large jump in ECE relative to the prior release indicates a calibration regression introduced by the change.

Input source
Diff against prior PCCP run
Gate 5

Retraction Monitoring

< 5% retracted-claim passthroughblocks deploy

Validates that the ingest pipeline's retraction detection is still working. Sourced from retraction_summary.json produced by skippy-ingest.

Input source
skippy-ingest retraction summary
What a gate failure looks like

If a code change degrades calibration — say ECE rises from 0.07 to 0.12 — deployment is immediately blocked. The CI pipeline receives exit 1, the run is logged with the failing value and timestamp, and no deployment proceeds until all 5 gates pass in a fresh run.

This is not a flag or a review queue — it is a hard stop. Skippy Auth cannot ship a version that downgrades Gate 2 (CONTRAINDICATED Preservation) because that gate directly protects against a clinical patient safety event.

Example: Gate 1 failure output
$ skippy-pccp run --format json
{
"gate": "Gate 1 — Calibration",
"result": "FAIL",
"ece": 0.12,
"threshold": "< 0.10",
"delta_from_prior": +0.05
},
"overall": "FAIL",
"deployable": false
exit 1 — CI pipeline blocked · deploy halted
FDA PCCP Traceability — Gate → Data Source → PCCP Section → Audit Field
GateThresholdData sourcePCCP sectionAudit log field
Gate 1 — ECE< 0.10skippy-ece-benchmark DDI (n=624)Modification Protocol §3.1 — Calibrationgate_1_ece
Gate 2 — CONTRAINDICATED< 10% failurePharmD golden set · κ=0.81Modification Protocol §3.2 — Safetygate_2_contraindicated_failure_rate
Gate 3 — Severity DriftNo ≥2-level downgradesVersioned golden comparisonImpact Assessment §4.1 — Regressiongate_3_severity_drift
Gate 4 — Calibration DriftECE Δ ≤ 0.02Diff vs prior PCCP runImpact Assessment §4.2 — Driftgate_4_ece_delta
Gate 5 — Retraction< 5% passthroughskippy-ingest retraction_summary.jsonModification Protocol §3.3 — Datagate_5_retraction_passthrough
Structured to match FDA PCCP marketing submission guidance (2023). Table exportable from skippy-pccp run --format pdf for inclusion in SaMD pre-submission packages.
Which gate protects which product
Gate 1 — ECEAll productsA model with ECE ≥ 0.10 cannot be deployed to any Skippy endpoint. Calibration is the baseline safety property the entire platform requires.
Gate 2 — CONTRAINDICATEDSkippy Auth, Skippy DDI, Skippy Med-CheckCONTRAINDICATED preservation is specifically required before any prior-authorization or drug-interaction product can ship. Gate 2 failure means Skippy Auth cannot be deployed — period.
Gate 3 — Severity DriftSkippy Auth, Skippy Variants, Skippy RareSystematic severity downgrades in any release are blocked. A new model version that reclassifies MAJOR drug interactions as MINOR cannot reach the API layer.
Gate 4 — Calibration DriftAll productsEven a passing absolute ECE can mask a regression. Gate 4 blocks releases where ECE degrades more than 0.02 from the prior passing run, regardless of the absolute value.
Gate 5 — RetractionAll productsSkippy ingest must catch retractions before they reach the verifier. A passthrough rate ≥ 5% means the ingest pipeline has failed — the entire deployment is blocked until the retraction list is re-applied.
Report formats
--format json
JSON
Machine-readable, CI/CD integration
--format jsonl
JSONL
Streaming, line-by-line gate results
--format pdf
PDF
FDA submission-ready report
Open Source · Apache 2.0

skippy-eval

Domain-specific benchmark and evaluation harness

skippy-eval is the evaluation harness that runs purpose-built test sets against each Skippy product domain. It produces structured output — IS-SUP and Self-RAG scores, per-item audit logs, and aggregate metrics — that feeds into the PCCP gate runner. The harness covers 40+ verticals, each with its own gold-standard labels and domain-specific scoring criteria.

CLI
$ skippy-eval run --dataset ddi
$ skippy-eval run --dataset rare-disease
$ skippy-eval run --dataset federal-contracting
$ skippy-eval run --all --output results/
# Scorers
skippy-eval score --scorer is-sup --results results/ddi.json
skippy-eval score --scorer self-rag --results results/ddi.json
IS-SUP scorer

Measures source support quality — whether the evidence cited by the response actually supports the claims made. Produces a per-sentence IS-SUP score and an aggregate for the full response.

// Per-item result
{
"is_sup": 0.92,
"unsupported_spans": 0,
"sources_checked": 4
}
Self-RAG scorer

Implements the Self-RAG evaluation protocol — measuring retrieval quality, relevance of retrieved passages to the query, and faithfulness of the generated output to those passages.

// Per-item result
{
"retrieval_score": 0.88,
"faithfulness": 0.94,
"relevance": 0.91
}
Domain coverage — 40+ vertical test sets
Medical
  • ·Drug interactions
  • ·Drug indications
  • ·Retraction detection
  • ·Rare disease
  • ·Pharmacogenomics
  • ·Prior auth
  • ·Medical coding
  • ·Pharmacovigilance
  • ·Biomarker validation
Legal & Compliance
  • ·Estate law
  • ·Tax compliance
  • ·Federal contracting
  • ·Insurance coverage
  • ·Audit & waste
  • ·Content moderation
  • ·Children's privacy
  • ·Crypto compliance
  • ·TPRM
Industry Safety
  • ·Aviation safety
  • ·Maritime compliance
  • ·Transportation (HOS/HazMat)
  • ·Food safety
  • ·Construction safety
  • ·Mining compliance
  • ·Environmental (ESG)
  • ·Animal welfare
Financial & Government
  • ·Banking compliance
  • ·Investment advisor
  • ·Real estate
  • ·Election finance
  • ·Government programs
  • ·Nonprofit compliance
  • ·Trade compliance
Audit-grade output

Every eval run produces JSON Schema-validated audit logs — immutable records of the dataset version, scorer version, and per-item scores at the time of the run. These logs are the primary input to skippy-pccp Gate 1 and can be submitted as part of an FDA SaMD performance documentation package.

Open Source · Apache 2.0

skippy-ece-benchmark

707-item confidence calibration study

The benchmark dataset and evaluation scripts behind the ECE < 0.10 claim. A 707-item clinical study across drug interaction classification, retracted-paper detection, and Cochrane systematic review alignment — each with gold-standard labels from a clinical pharmacist. The output feeds directly into PCCP Gate 1.

Drug Interaction Classification
Gate 1 input
n = 624
ECE 0.07 · AUROC 0.91

500 drug pairs from FDA labels, manually labeled. Five severity tiers. Gate threshold: ECE < 0.10.

Drug Indication
Gate 1 input
n = 360
ECE 0.09 · AUROC 0.87

Drug-disease associations. Cochrane systematic review outcomes as gold standard. Gate threshold: ECE < 0.10.

Retracted Paper Detection
Gate 5 input
n = 100
> 95% detection rate

100 biomedical claims from retracted papers. Retraction Watch database as gold standard. Gate threshold: < 5% passthrough.

Calibration trajectory — all runs, none suppressed
RunECE (DDI)AUROCGate 1Deploy
Q4 2025 · v1.00.080.89PASSPASSED
Q2 2026 · v1.1 (current)0.070.91PASSPASSED
Q4 2026 · v1.2 (target)< 0.06> 0.92TARGET
Full benchmark methodology →Includes 95% CIs, confidence interpretation table, and SourceCheckup comparison
In Progress · Q2–Q3 2026

SourceCheckup Self-Evaluation

Skippy scored against Stanford's peer-reviewed citation accuracy rubric

The Stanford SourceCheckup study (Nature Communications, April 2025) found that 50–90% of medical AI responses are not fully supported by their own cited sources — across seven frontier LLMs including GPT-4o. We are replicating the exact same methodology on Skippy. Same 800-question set. Same physician-adjudicated rubric. Same comparison baselines. A standard we did not design.

800 questions
Mayo Clinic Q&A + Reddit r/AskDocs

The paper's headline evaluation set — stratified by specialty and complexity. Skippy's results will be published against the same subset for direct comparability.

3 independent physicians
Fleiss' κ ≥ 0.60 target

Each (statement, cited source) pair adjudicated independently. Majority vote of 3. Inter-rater agreement reported alongside the headline results — matching the paper's achieved 86.1% raw agreement.

Head-to-head vs 7 LLMs
Same baselines as the paper

GPT-4o + Web Search and the six other models from the original study. Same rubric: statement-level support rate, response-level support rate, and citation validity — all with 95% Wilson confidence intervals.

Why this is a fair test

The paper's seven LLMs were evaluated on their own citation outputs. Skippy is evaluated on its own citation output — BeliefV1 node IDs resolved through the provenance graph to primary source text. The rubric is identical: does the cited source actually support the statement?

Skippy is not getting a Skippy-friendly rubric. The difference is the citation target — Skippy cites a structured belief with lineage rather than a URL. The physician adjudicators see the resolved source text, not the internal node ID. The comparison is apples-to-apples.

Publication commitment — active data collection underway, May 2026

Results will be published in full — methodology, side-by-side comparison table, and confidence intervals — regardless of outcome. Whitepaper submitted to NEJM AI for editorial consideration. Selective publication is not an option we have chosen.

·Full side-by-side comparison table vs 7 LLM baselines — Q3 2026
·95% Wilson confidence intervals on all three metrics
·Fleiss' κ and raw percent agreement reported (matching paper standards)
·Methodology and adjudication rubric published for external replication
·Quarterly CI regression gate enforced automatically post-publication
What the comparison table will show (Skippy column publishes Q3 2026)
ModelStmt supportResp supportCitation valid
Skippyactive collectionactive collectionactive collection
GPT-4o + Web Searchpaperpaperpaper
GPT-4opaperpaperpaper
Gemini Propaperpaperpaper
Claude 3paperpaperpaper
Llama 3paperpaperpaper
Mistralpaperpaperpaper
Perplexitypaperpaperpaper

Baseline numbers from Stanford SourceCheckup (Nature Communications, Apr 2025, DOI 10.1038/s41467-025-58551-6). The paper reports 50–90% of responses across 7 LLMs were not fully supported by their own cited sources. The best-performing model achieved ~50% response-level support. Skippy results publish Q3 2026.

Signing Infrastructure

skippy-transparency

Merkle tree transparency log and cryptographic response signing

Every signed Skippy response is produced by skippy-transparency — the server-side signing service that maintains the append-only Merkle tree and issues Ed25519-signed responses. skippy-verify (the open-source client) verifies what skippy-transparency produces. Together they form a complete, auditable chain from response issuance to independent third-party verification.

Append-only Merkle tree
Every response is appended to a Merkle tree. The root is published and can be independently verified — no Skippy infrastructure required after the fact.
Ed25519 response signing
Each response is signed with an Ed25519 key from the public key registry. Key rotation happens on a 90-day cadence; revocation is propagated through skippy-keys.
Sigstore Rekor integration
Transparency log roots are optionally submitted to Sigstore Rekor — a public, append-only transparency log. Provides an independent third-party timestamp and inclusion proof.
SHA-256 hash chain
Responses are SHA-256 hash-chained in an append-only audit log — tamper-evident by construction. Any modification to a logged response invalidates the chain from that point forward.
How signing and verification connect
Server
skippy-transparency signs response → appends to Merkle tree → embeds signature block in response
response.json delivered to caller
Client
skippy-verify reads signature block → looks up key via skippy-keys → validates Ed25519 signature + Merkle root
verification result: PASS or FAIL with specific failure reason
Auditor
Optional: cross-reference root against Sigstore Rekor public log — independent third-party timestamp
Key point

Once a response is issued, verification requires no Skippy infrastructure. A response from 2026 can be verified in 2030 using the public key archived at the time of signing. This is the property FDA SaMD PCCP and EU AI Act Article 13 require.

Concrete audit scenario

A CMS audit of a prior authorization decision built on Skippy would proceed as follows: the auditor downloads the response JSON, runs skippy-verify response.json locally, and receives an independent PASS/FAIL with the Merkle inclusion proof and Sigstore Rekor timestamp — no Skippy access required, no cooperation required, no ability for Skippy to alter the record after the fact.

What this means

Enforcement, not monitoring

For FDA SaMD submissions
21 CFR Part 820 · QMSPCCP §3 Modification ProtocolPCCP §4 Impact Assessment

The PCCP gate history is a production audit trail logged with dataset version, gate thresholds, and pass/fail result. The traceability table above maps each gate to its PCCP section and audit log field — structured for direct inclusion in a pre-submission package.

For EU AI Act compliance
Article 9 — Risk ManagementArticle 13 — TransparencyArticle 15 — Accuracy & Robustness

Article 15 requires documented accuracy monitoring with predefined thresholds. Article 13 requires technical documentation of how outputs are derived. Article 9 requires an ongoing risk management system. ECE gates, severity drift checks, and the cryptographic audit trail satisfy all three — by construction, not by policy.

For enterprise procurement
skippy-pccp — open source (Apache 2.0)skippy-verify — run yourselfPCCP PDF — submission-ready

When your procurement checklist asks 'how do you maintain accuracy?' — this is the answer. Not a process description. An open-source CLI you can run against Skippy's API yourself, producing the same exit codes and PCCP PDF that feed our production pipeline.

Request vendor validation package →
Run it yourself — you don't need our infrastructure or our permission

The validation gate is open-source under Apache 2.0. You can install skippy-pccpand run it against Skippy's API from your own environment — you receive the same exit codes, the same JSON gate output, and the same PDF report that our production CI pipeline uses. No cooperation required from Skippy. No trust required in our reports.

# Install and run the PCCP gate suite yourself
$ pip install skippy-pccp
$ skippy-pccp run --api-key $YOUR_KEY --format json
# Verify a specific signed response
$ pip install skippy-verify
$ skippy-verify response.json
Certification status
HIPAA Ready · BAA AvailableGDPR CompliantSOC 2 Type II · In audit · Q4 2026ISO 42001 · Gap assessment completeFDA SaMD · PCCP documentedEU AI Act Article 15 · Controls documented7-Year Audit Retention

Questions about the validation infrastructure?

We can walk through the PCCP gate history, the calibration methodology, and what the audit logs look like for your specific regulatory context.