Why We Are Open-Sourcing Clinical AI Evaluation

January 28, 2026·

ChartR Team

The Stakes Are Too High for Proprietary Evaluation

Clinical AI agents are entering healthcare at an unprecedented pace. They are generating recommendations, automating documentation, predicting outcomes, and communicating directly with patients. The question is no longer whether to evaluate these systems — it is who gets to define what "safe," "appropriate," and "reliable" mean.

Right now, the answer is fragmented. Each vendor evaluates their own system against their own benchmarks, using their own criteria, behind their own walls. The result is a patchwork of proprietary evaluation standards that no one outside the vendor can independently verify, compare, or challenge.

This is not good enough for medicine. When a clinical AI agent influences a treatment decision, every stakeholder — the clinician, the patient, the health system, and the regulator — deserves to know how that system was evaluated, what standards it was measured against, and whether those standards are rigorous. Proprietary evaluation makes that impossible.

We are open-sourcing ChartR's clinical AI evaluation and governance layer because we believe evaluation standards for clinical AI must be open, inspectable, and improvable by the entire community.

Why Open Standards Matter

Healthcare has a long history of open standards that improved care. HL7 and FHIR transformed health data interoperability. ICD, CPT, SNOMED, and LOINC created shared vocabularies that allow systems to communicate. These standards work because they are public, inspectable, and continuously refined by the community.

Clinical AI evaluation needs the same treatment. Four principles drive this conviction.

Interoperability

Any clinical AI agent — regardless of who built it — should be evaluable against a common set of standards. Today, comparing the safety and effectiveness of two different clinical AI systems is nearly impossible because each is evaluated on different terms. Open evaluation standards create a shared language for measuring what matters: clinical appropriateness, safety, fairness, explainability, and reliability.

Without interoperability, health systems are locked into trusting each vendor's self-reported evaluation — a dynamic that does not serve clinicians or patients.

Trust

Clinicians are being asked to incorporate AI-driven recommendations into their clinical judgment. That requires trust. And trust requires transparency — not just in what the AI recommends, but in how it was evaluated.

Proprietary evaluation is a trust deficit. When the criteria, benchmarks, and governance processes are hidden behind vendor walls, clinicians and health systems have no way to independently assess whether the evaluation was rigorous. Open evaluation frameworks make the standards visible, so trust can be earned rather than assumed.

Collective Improvement

No single organization has the breadth of clinical knowledge, the diversity of patient populations, or the range of edge cases required to build a truly comprehensive evaluation framework. Open-source evaluation gets better faster because the entire community — clinicians, researchers, health systems, regulators, and patient advocates — can contribute scenarios, refine criteria, and surface blind spots.

The complexity of medicine demands collective intelligence. A closed evaluation framework, no matter how well-designed, will always have gaps that an open community would catch.

Regulatory Alignment

Regulators are converging on transparency and standardization. The NIST AI Risk Management Framework identifies seven characteristics of trustworthy AI: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed. The FDA is increasingly focused on auditability and post-market surveillance for AI-enabled medical devices.

Open-source evaluation frameworks are naturally aligned with these regulatory expectations. They provide the transparency, reproducibility, and version-controlled governance that regulators require — without each organization having to build these capabilities from scratch.

What We Are Open-Sourcing

ChartR is making the following components of our clinical AI evaluation infrastructure available as open source:

Clinical rules and ontology framework — the structured knowledge base for evaluating clinical appropriateness, safety, and guideline alignment across clinical domains
Evaluation benchmarks and test suites — curated scenarios for measuring agent performance across the dimensions that matter: hallucination detection, clinical appropriateness, safety, fairness, robustness, explainability, and financial/regulatory compliance
Governance structure — the versioning, review, and update processes that ensure evaluation criteria evolve with clinical evidence and regulatory requirements

The insight behind this decision is straightforward: the same rules framework we built to evaluate and improve human provider performance applies directly to evaluating AI agents. Clinical quality is clinical quality — whether the provider is a physician, a nurse, or an autonomous agent. The standards should be the same, and they should be open.

A Growing Movement

We are not alone in this conviction. The Coalition for Health AI (CHAI) is maintaining open-source testing and evaluation frameworks for specific clinical use cases. The PRIMARY-AI project is building international consensus on AI evaluation standards across more than 100 organizations. Microsoft released an open-source Healthcare AI Model Evaluator to help organizations benchmark AI systems using their own data and metrics.

This is a growing movement — and it needs more contributors. The more organizations that adopt, refine, and contribute to open evaluation standards, the stronger the standards become for everyone.

An Invitation

Open-source evaluation is not just a philosophy. It is a practical necessity for an industry where the stakes are measured in patient outcomes.

As clinical AI agents become more capable and more autonomous, the evaluation layer must keep pace. We are making our framework available so that any clinical AI agent, regardless of who built it, can be evaluated against clear, open, community-maintained standards. Health systems should be able to compare agents on equal terms. Regulators should be able to inspect evaluation criteria. Clinicians should be able to trust that the AI assisting them has been rigorously and transparently assessed.

The goal is not to control evaluation. It is to make it universal.

Back to all posts