The architecture behind the claim
Three things that distinguish Skippy from every other AI system in this space — and why having one without the others gets you something weaker than what already ships.
Verifier-gated output
Every response passes through a verifier before it reaches the user. If the output cannot be grounded in evidence, it is rejected — not hedged, not softened, rejected. The verifier adds less than 5ms of latency and produces a cryptographically versioned audit record on every call.
Most AI systems run a post-hoc confidence score or add a disclaimer. Skippy's verifier is upstream of the response, not downstream. The difference is the difference between a safety gate and a warning label.
Calibrated confidence
Every finding in Skippy carries a confidence score validated against held-out data — not generated by the model itself. Skippy distinguishes convergent evidence from contested findings from sparse data, and surfaces that distinction explicitly in every output.
LLM confidence is self-reported: the model says it's confident because that's what the training signal rewarded. Skippy's confidence is a measurement, not a posture.
Auditable to the source
The citation target in Skippy is a verified, versioned finding — not a URL, not a document, not a paragraph chunk. Each finding carries its source lineage, confidence tier, and the evidence that produced it. Every downstream product inherits this audit trail by construction.
Frontier systems cite documents. Stanford SourceCheckup (Nature Communications, Apr 2025) found 50–90% of medical AI responses are not fully supported by their own cited sources. Citing a verified finding is a different claim than citing a document that might support it.
What happens when you call the API
Every API call follows the same deterministic path. No request bypasses the verifier. No response is issued without an audit record.
If the verifier fires a critical or error violation, the request is blocked at that stage — no response is issued, the error is returned to the caller, and the failure is logged with the specific violation code and evidence that triggered it.
What happens when evidence is absent, contested, or uncertain
Most AI systems express uncertainty as hedging language. Skippy expresses it as structured output — the same contract every time.
| Scenario | General AI | Skippy |
|---|---|---|
| Evidence is absent | Hallucination or "I'm not sure" with no precision | Returns NOT_COVERED — explicit knowledge boundary, no confabulation |
| Evidence is contested | Model picks a side or averages conflicting signals | Returns both positions with quantified confidence divergence and source lineage for each side |
| Confidence is low | "May", "might", "could" — hedges that vary by phrasing | Returns evidence_pattern: "sparse" with calibrated confidence score below threshold |
| Any call is made | Log entry, if any | Immutable audit record: cryptographic hash, sources used, confidence at time of call |
What Skippy cites vs. what everyone else cites
The citation target is a verified, versioned finding — not a document, not a paragraph, not a URL. Every finding carries its confidence score, source lineage, and the evidence that produced it.
This distinction matters for regulatory submissions, clinical audit trails, and any workflow where "the AI cited a source" is not sufficient — you need to know what the source actually supports.
What the verifier actually checks
"Verifier-gated" is not a single binary test. The verifier evaluates every cited span in the output against 11 specific failure modes, each with a severity level. Errors block the response. Warnings appear in the audit record.
| Code | Severity | Trigger |
|---|---|---|
| V001 | error | Unsupported claim — cited evidence does not entail the output span |
| V002 | error | Uncited claim — output span makes a claim with no citation |
| V003 | warning | Irrelevant citation — evidence is technically accurate but irrelevant to the question |
| V004 | warning | Low-utility response — accurate and supported but does not address the user's decision |
| V005 | critical | Finding not found — cited finding ID does not resolve |
| V006 | critical | Retracted source — cited primary source carries a retraction notice |
| V007 | warning | Outdated lineage — source evidence older than domain freshness threshold |
| V008 | info | Contested evidence — cited finding's evidence pattern is contested; flagged for user awareness |
| V009 | warning | Below-threshold confidence — confidence score below minimum for the response context |
| V010 | critical | PII in response — output contains apparent patient-level identifiers |
| V011 | critical | Safety gate bypassed — a required safety check was not applied |
Request blocked. Error returned to caller. Violation code and span logged.
Request blocked. Structured error response with code and evidence span.
Request proceeds. Flag written to audit record. Not surfaced in end-user output.
Request proceeds. Observation logged only. No flag in output or audit record.
The verifier integration harness, audit-log JSON Schema, and violation taxonomy are available now as skippy-verifier — Apache 2.0, pluggable NLI backends, installable today.
What every product actually cites
The citation target is the difference. Citing a document is not the same as citing a verified finding. One passes a retrieval test. The other passes a regulatory audit.
| Product | What it cites | Evidence verified? | Lineage exposed? |
|---|---|---|---|
| GPT-4o / OpenAI | Document chunks returned by search | No | None |
| Gemini / Google | Web pages, document snippets | No | None |
| Claude / Anthropic | Retrieved documents, URLs | No | None |
| Perplexity | Web search results, page URLs | No | None |
| OpenEvidence | Indexed medical articles | Retrieval only — no cross-validation | None |
| Microsoft GraphRAG | Graph-extracted text chunks | No | None |
| Atropos Alexandria | Real-world evidence records | No — no calibration published | None |
| Skippy | Verified findings — not documents | Yes — calibrated against held-out data | Full: source ID, version date, confidence |
Assessed against public documentation as of May 2026.
Zero-retention. Audit-log only.
Skippy does not retain request content. Every audit record contains only metadata — no patient data, no query text, no response payload — hash-signed and stored for traceability.
Request content is not stored after the response is issued. No training on customer data.
Hash, timestamp, violation codes, finding IDs, confidence tier — no payload content.
In transit and at rest. Audit logs are hash-signed and append-only.
BAA available. FedRAMP authorization in progress. Contact us for documentation.
Three ways to integrate
Skippy is designed to fit existing workflows — not replace them. Choose the pattern that matches your deployment context.
POST to /v1/ground/verify inline with your existing response pipeline. Verdict + audit_id returned in the same call. Adds ~12ms.
POST /v1/ground/verify Authorization: Bearer <key>
Submit arrays of output spans for batch verification. Results returned as a structured report with per-span verdict, confidence, and lineage. No latency constraint.
POST /v1/ground/batch
{ spans: [...], context: ... }Implements the CDS Hooks 2.0 spec. Drop into existing EHR hook endpoints without custom middleware. FHIR-compatible response format.
hook: patient-view fhirServer: <endpoint>
Verification at production latency
The verifier adds less than 5ms. The full response — evidence retrieval, confidence scoring, source lineage, and audit record — averages under 12ms end-to-end.
Structured ontologies + empirical evidence
Skippy's medical knowledge is grounded in two distinct layers that reinforce each other. Formal ontology relationships — drawn from OBO Foundry ontologies — define what concepts mean and how they relate. Empirical evidence from literature and clinical data defines what the research shows. A query can walk both simultaneously, producing answers that require neither layer alone.
This is what symbolic AI projects like Cyc attempted — a rigorous structured knowledge base — but without the empirical grounding that makes it useful in practice. Skippy has both.
Formal biological process, molecular function, and cellular component relationships. Over 250K structured edges used in genomics and drug-mechanism queries.
Standardized disease classification with is-a and related-to hierarchies. Aligns with DisGeNET and PrimeKG upstream sources already in the evidence layer.
Phenotype-to-disease mappings with formal axioms. Enables rare-disease reasoning and Orphanet-linked clinical workflows.
Two epistemic regimes. One growing evidence base.
Most knowledge systems have one way findings enter the evidence base: ingest more data. Skippy has two. Every finding carries an epistemology property that records how it was produced — and the evidence base can be queried across both regimes simultaneously.
A claim supported by both a clinical trial result and a first-principles derivation from mechanism is a stronger claim than either alone. Skippy surfaces that distinction explicitly.
Produced by the 24-agent ingestion pipeline from 169.4M ingested source documents. Grounded in observed data — clinical trials, adverse event reports, genomic studies, systematic reviews. Confidence is calibrated against held-out data (ECE < 0.10).
Produced by the growth-era pipeline from axioms and formal definitions — first-principles reasoning that doesn't require a study to exist. 608 knowledge phases are ingest-ready today across mathematics, logic, pharmacology foundations, and biomedical ontologies.
When a finding is supported by both empirical evidence and a derivational reasoning chain, it is marked "both" — the strongest epistemic standing in the evidence base. Queries can filter to dual-grounded claims only, returning only what evidence and reason agree on.
The conductor runs as a background process — no human in the loop for knowledge acquisition. Every write is schema-validated. Every new finding runs through the same PCCP gates as the existing corpus before deployment.
Want to go deeper?
We can walk through the architecture, the verifier integration, and what it means for your specific use case.