The Problem

Evidence-grade AI requires a different architecture

Large language models hallucinate with confidence in high-stakes domains. This is not a capability gap that will close with larger models or better prompting. It is architectural — a property of how transformer training works. The only solution is a different foundation.

The empirical evidence

“50–90% of medical AI responses are not fully supported by their own cited sources — sometimes contradicted by them.”

Stanford SourceCheckup, Nature Communications, April 2025 · 800 questions · ~58,000 statement-source pairs · 7 frontier LLMs including GPT-4o

Skippy — independent validationECE 0.07 calibration score · threshold < 0.10κ = 0.81 PharmD inter-rater agreement5 / 5 PCCP gates passFull methodology →

Five Structural Gaps

What LLMs and citation-adjacent systems cannot provide — by design

These are not bugs to be fixed in the next model release. They are properties of how transformer training works. Scaling does not fix them. RAG mitigates but does not solve them.

No provenance

When an LLM says "Drug X is contraindicated in renal failure," there is no traceable path to a source. The claim came from blended training data — millions of documents, indistinguishably mixed. There is no source document ID to show a regulator.

No lineage walk — even for systems that cite

RAG pipelines and paper-citation systems (OpenEvidence, ClinicalKey AI, ChatGPT for Healthcare) can surface a document reference. But citation is not lineage. Lineage means: follow the path from the specific output claim, through the evaluation steps, to the originating source — with a confidence score at every hop. No citation-adjacent system stores this path. The retrieval context exists only for the duration of the query and is discarded. Skippy's evidence lineage record is built during ingestion and persists — it exists before the query arrives.

No calibration

LLM confidence is expressed through hedging language, not evidence. A claim backed by 1,000 randomized trials and a claim backed by one case report sound equally confident. There is no way to tell them apart from the output.

No conflict detection

When Source A and Source B disagree, the LLM averages them. The contradiction is erased during training. SGLT2 inhibitors help or hurt? The answer depends on which papers dominated the training mix — invisibly.

No knowledge boundary

LLMs have no structural "I don't know" state. They generate plausible text past the edge of their knowledge, silently. In clinical contexts, confident hallucination is more dangerous than an honest refusal.

Why scaling doesn't fix it

These are not accuracy problems. A model that is 99.9% accurate still fails a regulatory audit, because the question regulators ask is not “how accurate is this?” — it is “show me your evidence chain.” LLMs have no evidence chain to show. Provenance, calibrated confidence, conflict detection, and knowledge boundaries require a different architecture — not a bigger model.

The Audit Test

What a regulator actually asks

When FDA, CMS, or a health system legal team audits an AI-generated recommendation, these are the questions they ask. Neither general AI nor citation-adjacent systems (OpenEvidence, RAG pipelines) can answer the first and most important one.

Regulatory question	LLM / General AI	Citation-adjacent (RAG / OpenEvidence)	Skippy
Walk me from this output back to its source document	Cannot provide	Paper reference only — no traversable path	✓ evidence lineage record: walk any finding to PMID, DOI, or URL in a single traversal
Show me the source document and version date for this specific claim	Cannot provide	Article-level ID only; no claim-level mapping	✓ Full source lineage on every finding
How many independent sources support this?	Cannot provide	Cannot provide	✓ Evidence count and pattern in every response
Are there sources that contradict this?	Cannot provide	Blended — contradiction not flagged	✓ Contradictions explicitly surfaced, not averaged away
Where does your knowledge end on this topic?	Silent hallucination	Low-confidence guess or refusal	✓ "Not covered" is a first-class API response
What did your system believe on a specific prior date?	Cannot provide	Cannot provide	✓ Every finding is versioned and retrievable
Is this confidence score calibrated against evidence?	Token probability only	None	✓ ECE-validated confidence score per finding

Regulatory Deadlines

The window for non-compliant AI is closing

Three regulatory frameworks now require what LLMs structurally cannot provide. These are not future risks — they are active enforcement timelines.

EU AI Act

Enforcing

August 2026

High-risk AI systems in healthcare must provide technical documentation of the knowledge base, how outputs are derived, and audit records for every decision. Article 13 requires deployers to understand and explain every output.

CMS Prior Authorization Rule

Active

January 2027 full compliance

CMS explicitly addressed AI-driven PA in the rule preamble: "Where AI or algorithms are used to support PA decisions, the rationale must be traceable to specific clinical criteria that are publicly available and versioned."

FDA SaMD AI/ML Action Plan

Active

Ongoing

Software as a Medical Device must support independent clinician review of the evidence basis for each recommendation. A system that cannot show its evidence chain cannot satisfy the CDS exemption.

Skippy's compliance posture

Framework	What regulators require	Skippy status
EU AI Act Art. 13	Technical documentation, audit trail per decision	✓ 5/5 PCCP gates pass · full audit log per call
CMS PA Rule	Traceable rationale to versioned clinical criteria	✓ Source ID + version date on every recommendation
FDA SaMD CDS	Clinician-reviewable evidence basis	✓ PMID/DOI-level lineage exposed via API

Full security & compliance documentation →

The Architecture Argument

Lineage cannot live in weights

A model that learns from a document absorbs its signal into billions of floating-point weights. That document's identity — who published it, when, what it claimed — is gone. The model cannot show you what it knows, because "what it knows" is not stored as facts. It is stored as statistical tendencies across a multi-billion-parameter matrix.

Frontier labs recognized this and bolted retrieval on top of generation. RAG improves citation frequency — but retrieval-time grounding still relies on the model to judge relevance, extract the claim, and decide whether the source actually supports it. The same 50–90% unsupported citation rate applies.

Skippy does not bolt retrieval onto generation. It maintains a structured evidence base where every finding exists as a discrete, versioned object with provenance, confidence, and source lineage baked in at ingest time — not reconstructed at query time. The LLM calls Skippy for ground truth. It does not supply it.

How provenance works

Source document is ingested and assigned a permanent, versioned ID.

Specific claims are extracted and linked to the source record — not embedded, linked.

Confidence is calibrated against held-out data at ingest time, not generated at query time.

The finding exists as a queryable object. Every downstream response inherits its lineage.

Competitive Asymmetry

Why frontier labs can't replicate this

The instinct is: "GPT-5 will solve this." It won't. The constraints are not about capability — they are about what frontier labs are optimized to do.

Dimension	Frontier lab	Skippy
Domain scope	Everything, everywhere — breadth is the product	One domain done to depth before the next
Compute scale	Maximize capability per token — any topic, any user	Maximize verifiability per domain — specific regulated use cases
Provenance priority	Attribution is a feature request, not a core constraint	Provenance is the product — every finding has a traceable lineage
Regulatory context	Consumer and enterprise general use — regulators are secondary	FDA, CMS, EU AI Act compliance is the primary design constraint
Curated data utility	Curated sources are a small fraction — web scale dominates	500K+ structured ontology concepts (Gene Ontology, Disease Ontology, HPO) plus authoritative literature — formal structure and empirical evidence reinforce each other
Knowledge boundary	Trained to be helpful — admitting uncertainty conflicts with objective	"Not covered" is a first-class response — honest limits are a hard contract

Auditable by design

The verifier is open source — Apache 2.0. Every one of the 11 violation codes (V001–V011) is documented and testable. Regulated buyers can audit the verification logic independently, not on a vendor's word.

skippy-verifier on GitHub →

Zero PHI retention

No patient data stored after the request resolves. Audit records contain only metadata — hash, violation codes, finding IDs, confidence tier. No payload. No PHI at rest.

~12ms avg response · 4,500 req/s max throughput

The asymmetry is structural. A frontier lab that narrows to regulatory-grade medical evidence verification stops being a frontier lab — it stops serving the use cases its business depends on. Skippy is purpose-built for the cases that require verifiability, not the cases that reward breadth.

The Answer

What Skippy is built to do

Skippy is a verifiable AI platform — every output gated by a verifier, every claim traced to a verified finding with calibrated confidence and full source lineage. It doesn't replace LLMs. It provides the evidence ground truth that LLMs call.

The LLM owns the interface. Skippy owns the ground truth. The citation target is a verified finding — not a URL, not a document chunk, not a paper. A specific, versioned, calibrated claim that can be walked back to its sources.

See how it works Explore Medical →

Verifier-gated output

Ungrounded output is rejected as a hard contract before delivery. Not filtered — rejected.

Calibrated confidence

Every finding carries a confidence score validated against held-out data. Convergent, contested, or sparse — the distinction is explicit.

Full source lineage

Every claim is traceable to the evidence that supports it. Source document ID, version date, evidence count, and contradictions.

First-class knowledge boundary

"Not covered" is a real API response. Skippy is honest about what it doesn't know — and refuses to fabricate past its knowledge boundary.

Next Step by Role

Where to go from here

CTO / Technical Lead

How does the verifier work, and what's the actual latency cost?

Architecture deep-dive →Benchmark methodology →

Compliance / Legal

Does this satisfy EU AI Act Article 13 and CMS PA traceability?

Security & compliance →Validation methodology →

Clinical / Medical Lead

What does this look like in practice for clinical workflows?

Medical domain →Embedded CDS →

Ready to see the difference?

We work with health systems, life sciences companies, clinical AI developers, and anyone else who needs AI that can show its evidence chain.

Request a Demo Technical architecture →