A new national standard for safe, scalable AI in health care

A new national standard for safe, scalable AI in health care

A new national standard for safe, scalable AI in health care
Qualitative vs quantitative evaluation of LLMs. Credit: Journal of the American Medical Informatics Association (2025). DOI: 10.1093/jamia/ocaf023

Duke University School of Medicine researchers have developed two pioneering frameworks designed to evaluate the performance, safety, and reliability of large-language models in health care.

Published in npj Digital Medicine and the Journal of the American Medical Informatics Association (JAMIA), these studies offer a new approach to ensuring that AI systems used in clinical settings meet the highest standards of quality and accountability.

As large-language models become increasingly embedded in medical practice—generating clinical notes, summarizing conversations, and assisting with patient communications—health systems are grappling with how to assess these technologies in ways that are both rigorous and scalable. The Duke-led studies, under the direction of Chuan Hong, Ph.D., assistant professor in Duke’s Biostatistics and Bioinformatics, aim to fill that gap.

The npj Digital Medicine study introduces SCRIBE, a structured evaluation framework for Ambient Digital Scribing tools. These AI systems generate clinical documentation from real-time patient-provider conversations. SCRIBE draws on expert clinical reviews, automated scoring methods, and simulated edge-case testing to evaluate how well these tools perform across dimensions like accuracy, fairness, coherence, and resilience.

“Ambient AI holds real promise in reducing documentation workload for clinicians,” Hong said. “But thoughtful evaluation is essential. Without it, we risk implementing tools that might unintentionally introduce bias, omit critical information, or diminish the quality of care. SCRIBE is designed to help prevent that.”

A second, related study in JAMIA applies a complementary framework to assess large-language models used by the Epic electronic medical record platform to draft replies to patient messages. The research compares clinician feedback with automated metrics to evaluate aspects such as clarity, completeness, and safety.

While the study found strong performance in tone and readability, it also revealed gaps in the completeness of responses—emphasizing the importance of continuous evaluation in practice.

“This work helps close the distance between innovative algorithms and real-world clinical value,” said Michael Pencina, Ph.D., chief data scientist at Duke Health and co-author of both studies. “We are showing what it takes to implement AI responsibly, and how rigorous evaluation must be part of the technology’s life cycle, not an afterthought.”

Together, these frameworks form a foundation for responsible AI adoption in health care. They give clinical leaders, developers, and regulators the tools to assess AI models before deployment and monitor their performance over time—ensuring they support care delivery without compromising safety or trust.

More information:
Haoyuan Wang et al, An evaluation framework for ambient digital scribing tools in clinical applications, npj Digital Medicine (2025). DOI: 10.1038/s41746-025-01622-1

Chuan Hong et al, Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments, Journal of the American Medical Informatics Association (2025). DOI: 10.1093/jamia/ocaf023

Provided by
Duke University

Citation:
A new national standard for safe, scalable AI in health care (2025, June 23)
retrieved 24 June 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.


link

Leave a Reply

Your email address will not be published. Required fields are marked *