Choi, E. et al. Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. In Proc. Advances in Neural Information Processing Systems. Vol. 29 (Curran Associates, Inc., 2016).
Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: Contrastive learning from unpaired medical images and text. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing. 3876–3887 (Association for Computational Linguistics, 2022).
Wang, Z. & Sun, J. PromptEHR: conditional electronic healthcare records generation with prompt learning. In Proc. Conference on Empirical Methods in Natural Language Processing. 2873–2885. (Association for Computational Linguistics, 2022).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Google Scholar
Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems. Vol. 33, 1877–1901 (Curran Associates, Inc., 2020).
Wan, P. et al. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat. Med. 30, 2878–2885 (2024).
Yang, J. et al. Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond. ACM Trans. Knowl. Discov. Data 18, 1–32 (2024).
Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform. 25, bbad493 (2024).
Google Scholar
Au Yeung, J. et al. AI chatbots not yet ready for clinical use. Front. Digit. Health 5, 1161098 (2023).
Google Scholar
Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329, 842–844 (2023).
Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Google Scholar
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proc. Advances in Neural Information Processing Systems. Vol. 33, 9459–9474 (Curran Associates, Inc., 2020).
Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. In Proc. Findings of the Association for Computational Linguistics: ACL 2024 (eds Ku, L.-W., Martins, A. & Srikumar, V.) 6233–6251. (Association for Computational Linguistics, Bangkok, Thailand, 2024).
Nakano, R. et al. WebGPT: Browser-assisted question-answering with human feedback. Preprint at arXiv (2021).
Jin, Q. et al. AgentMD: empowering language agents for risk prediction with large-scale clinical tool learning. Preprint at arXiv (2024).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. Medicine 84, 77–3 (2023).
Zaharia, M. et al. The shift from models to compound AI systems. (2024).
Lin, J., Xu, H., Wang, Z., Wang, S. & Sun, J. Panacea: a foundation model for clinical trial search, summarization, design, and recruitment. Preprint at (2024).
Wang, H. et al. Towards adapting open-source large language models for expert-level clinical note generation. Preprint at arXiv (2024).
Khattab, O. et al. DSPy: compiling declarative language model calls into self-improving pipelines. In Proc. R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models (2023).
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).
Google Scholar
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. Preprint at arXiv (2023).
Chen, Z. et al. MEDITRON-70b: scaling medical pretraining for large language models. Preprint at arXiv (2023).
Jiang, A. Q. et al. Mixtral of experts. Preprint at arXiv (2024).
Wei, J. et al. Finetuned language models are zero-shot learners. In Proc. International Conference on Learning Representations. OpenReview.net (2021).
Li, X. L. & Liang, P. Prefix-Tuning: optimizing continuous prompts for generation. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Vol. Long Papers, 4582–4597 (Association for Computational Linguistics (ACL), 2021).
Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In Proc. International Conference on Learning Representations. OpenReview.net (2023).
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proc. Advances in Neural Information Processing Systems. Vol. 35, 27730–27744 (Curran Associates, Inc., 2022).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. Advances in Neural Information Processing Systems. Vol. 35, 24824–24837 (Curran Associates, Inc., 2022).
Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. In Proc. Eleventh International Conference on Learning Representations. OpenReview.net (2022).
Chen, L. et al. Are more LLM calls all you need? towards scaling laws of compound inference systems. Preprint at arXiv (2024).
Jin, Q. et al. Matching patients to clinical trials with large language models. Nat. Commun. 15, 9074 (2024).
Google Scholar
Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E. & Singh, S. AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4222–4235 (Association for Computational Linguistics (ACL), 2020).
Cheng, J. et al. Black-box prompt optimization: aligning large language models without model training. Preprint at arXiv (2023).
Wen, Y., Wang, Z. & Sun, J. MindMap: knowledge graph prompting sparks graph of thoughts in large language models. Preprint at arXiv (2023).
Zakka, C. et al. Almanac-retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).
Arasteh, S. T. et al. RadioRAG: factual large language models for enhanced diagnostics in radiology using dynamic retrieval augmented generation. Preprint at arXiv (2024).
Wang, Z. & Sun, J. Trial2vec: Zero-shot clinical trial document similarity search using self-supervision. In Proc. Findings of the Association for Computational Linguistics: EMNLP 2022 6377–6390 (2022).
Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 39, btad651 (2023).
Google Scholar
Wu, T., Terry, M. & Cai, C. J. AI chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proc. 2022 CHI Conference on Human Factors in Computing Systems 1–22 (2022).
Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proc. Eleventh International Conference on Learning Representations. OpenReview.net (2022).
Wang, Z. et al. Accelerating clinical evidence synthesis with large language models. (2024).
Wang, Z., Danek, B., Yang, Z., Chen, Z. & Sun, J. Can large language models replace data scientists in clinical research? Preprint at arXiv (2024).
Asai, A. et al. OpenScholar: synthesizing scientific literature with retrieval-augmented language models. Preprint at Arxiv (2024).
Introducing deep research. (2025).
Gravitas, S. Autogpt. (2023).
Wu, Q. et al. AutoGen: enabling next-gen LLM applications via multi-agent conversation framework. Preprint at arXiv (2023).
Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 15, 1603 (2024).
Google Scholar
Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. Nature 625, 476–482 (2024).
Google Scholar
Gu, Y. et al. Middleware for LLMs: tools are instrumental for language agents in complex environments. Preprint at arXiv (2024).
Gao, T., Yen, H., Yu, J. & Chen, D. Enabling large language models to generate text with citations. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 6465–6488 (2023).
Kim, Y. et al. MDAgents: an adaptive collaboration of LLMs for medical decision-making. In Proc. Advances in Neural Information Processing Systems. Vol. 37, 79410–79452 (2025).
Mukherjee, S. et al. Polaris: a safety-focused llm constellation architecture for healthcare. Preprint at arXiv (2024).
Huang, K. et al. Automated hypothesis validation with agentic sequential falsifications. Preprint at arXiv (2025).
Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The virtual lab: AI agents design new SARS-CoV-2 nanobodies with experimental validation. Preprint at bioRxiv (2024).
Semnani, S., Yao, V., Zhang, H. C. & Lam, M. WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (2023).
Chen, L., Zaharia, M. & Zou, J. FrugalGPT: How to use large language models while reducing cost and improving performance. Preprint at arXiv (2023).
Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Digit. Med. 6, 158 (2023).
Google Scholar
Bhaskar, A., Fabbri, A. & Durrett, G. Prompted opinion summarization with GPT-3.5. In Proc. Findings of the Association for Computational Linguistics: ACL 2023 9282–9300 (2023).
Park, J. et al. Criteria2query 3.0: leveraging generative large language models for clinical trial eligibility query generation. J. Biomed. Inform. 154, 104649 (2024).
Google Scholar
Datta, S. et al. AutoCriteria: a generalizable clinical trial eligibility criteria extraction system powered by large language models. J. Am. Med. Inform. Assoc. 31, 375–385 (2024).
Google Scholar
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Proc. Conference on Empirical Methods in Natural Language Processing 1998–2022 (2022).
Jeong, D., Garg, S., Lipton, Z. C. & Oberst, M. Medical adaptation of large language and vision-language models: Are we making progress? In Proc. 2024 Conference on Empirical Methods in Natural Language Processing, 12143–12170 (2024).
Zhang, G. et al. Closing the gap between open source and commercial large language models for medical evidence summarization. npj Digit. Med. 7, 239 (2024).
Google Scholar
Klang, E. et al. A strategy for cost-effective large language model use at health system-scale. npj Digit. Med. 7, 320 (2024).
Google Scholar
OpenAI. Function calling and other api updates. (2023).
Chase, H. LangChain. (2022).
Jin, Q., Yang, Y., Chen, Q. & Lu, Z. GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics 40, btae075 (2024).
Google Scholar
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 47, D23 (2019).
Google Scholar
Lin, C-Y. ROUGE: a package for automatic evaluation of summaries. In Proc. Text Summarization Branches Out. 74–81 (Association for Computational Linguistics, 2004).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTscore: evaluating text generation with BERT. In Proc. International Conference on Learning Representations. OpenReview.net (2019).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134-1142 (2024).
Abacha, A. B., Yim, W.-w., Adams, G., Snider, N. & Yetisgen-Yildiz, M. Overview of the MEDIQA-Chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In Proc. 5th Clinical Natural Language Processing Workshop. 503–513 (Association for Computational Linguistics, 2023).
Nelson, H. Epic announces ambient clinical documentation EHR integration. Accessed 5 September 2024, Accessed on Jun 2025 (2023).
Yim, W.-w. et al. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Sci. Data 10, 586 (2023).
Google Scholar
Soroush, A. et al. Large language models are poor medical coders-benchmarking of medical code querying. NEJM AI 1, AIdbp2300040 (2024).
Wang, H., Gao, C., Dantona, C., Hull, B. & Sun, J. Drg-llama: tuning llama model to predict diagnosis-related group for hospitalized patients. npj Digit. Med. 7, 16 (2024).
Google Scholar
Topaz, M., Shafran-Topaz, L. & Bowles, K. H. ICD-9 to ICD-10: evolution, revolution, and current debates in the united states. Perspect. Health Inf. Manag. 10, 1d (2013).
Peggy, D. CPT® Codes: What Are They, Why Are They Necessary, and How Are They Developed? Advances in Wound Care 2, 583–587 (2013).
ICD-10-CM/PCS MS-DRG V40.1 Definitions Manual. Accessed June 23, 2025. https://www.cms.gov/icd10m/fy2023-version40.1-fullcode-cms/fullcode_cms/P0006.html.
Wornow, M. et al. Zero-shot clinical trial patient matching with LLMs. Preprint at arXiv (2024).
Nievas, M., Basu, A., Wang, Y. & Singh, H. Distilling large language models for matching patients to clinical trials. J. Am. Med. Inform. Assoc. 31, 1953–1963 (2024).
Unlu, O. et al. Retrieval-augmented generation–enabled GPT-4 for clinical trial screening. NEJM AI 1, AIoa2400181 (2024).
Wong, C. et al. Scaling clinical trial matching using large language models: a case study in oncology. In Proc. Machine Learning for Healthcare Conference. 846–862 (Proceedings of Machine Learning Research (PMLR), 2023).
Shaib, C. et al. Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success). In Proc. Annual Meeting Of The Association For Computational Linguistics. (Association for Computational Linguistics, 2023).
Yun, H., Marshall, I., Trikalinos, T. & Wallace, B. Appraising the potential uses and harms of LLMs for medical systematic reviews. In Proc. Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2023).
Page, M. J. et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. Syst. Rev. 10, 89 (2021).
Wang, Z. et al. A foundation model for human-ai collaboration in medical literature mining. Preprint at arXiv (2025).
Wang, S., Harrisen, S., Bevan, K. & Guido Z. Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search? In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1426–36 (NewYork, NY, USA: ACM, 2023).
National Institute of Standards and Technology. NIST Privacy Framework: A Tool for Improving Privacy through Enterprise Risk Management, Version 1.0. Technical Report (National Institute of Standards and Technology, 2020).
National Institute of Standards and Technology. The NIST Cybersecurity Framework (CSF) 2.0. Technical Report NIST CSWP 29 (U.S. Department of Commerce, 2024).
Quentin, C., Steinhagen, D., Francis, M. & Streff, K. Towards a Triad for Data Privacy. In Proceedings of the Annual Hawaii International Conference on System Sciences. (Hawaii International Conference on System Sciences (2020) https://doi.org/10.24251/hicss.2020.535.
Tabassi, E. Artificial Intelligence Risk Management Framework (AI RMF 1.0): AI RMF (1.0). Technical Report (2023).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at arXiv [cs.CL] (2023).
Chaves, J. M. Z. et al. Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. Preprint at arXiv [cs.CL] (2024).
Zhang, S. et al. Large-scale domain-specific pretraining for biomedical vision-language processing. Preprint at arXiv (2023).
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).
Google Scholar
Lu, M. Y. et al. A visual-language foundation model for computational pathology. Nat. Med. 30, 863–874 (2024).
Google Scholar
Li, C. et al. LLaVA‑Med: Training a Large Language‑and‑Vision Assistant for Biomedicine in One Day. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (Datasets & Benchmarks Spotlight) (2023).
Gu, Y. et al. BiomedJourney: Counterfactual biomedical image generation by instruction-learning from multimodal patient journeys. Preprint at arXiv [cs.CV] (2023).
Bluethgen, C. et al. A vision-language foundation model for the generation of realistic chest X-ray images. Nat. Biomed. Eng. 9, 494–506 (2025) .
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).
Google Scholar
Lei, H. et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems. 43, 1–55 (2025).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Google Scholar
Kevin, W. et al. An Automated Framework for Assessing How Well LLMs Cite Relevant Medical References. Nat. Commun. 16, 3615 (2025).
Sams, C. M., Fanous, A. H. & Daneshjou, R. Human-artificial intelligence interaction research is crucial for medical artificial intelligence implementation. J Investig. Dermatol. (2024).
Carlini, N. et al. Extracting training data from large language models. In Proc. 30th USENIX Security Symposium (USENIX Security 21). 2633–2650 (USENIX Association, 2021).
Theodorou, B., Xiao, C. & Sun, J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat. Commun. 14, 5305 (2023).
Google Scholar
Das, T., Wang, Z. & Sun, J. TWIN: Personalized clinical trial digital twin generation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 402–413 (Association for Computing Machinery (ACM), 2023). https://doi.org/10.1145/3580305.3599370.
Torfi, A., Fox, E. A. & Reddy, C. K. Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 586, 485–500 (2022).
Schaeffer, R., Miranda, B. & Koyejo, S. Are emergent abilities of large language models a mirage? In Proc. Advances in Neural Information Processing Systems. Vol. 36 (Curran Associates, Inc., 2024).
Zini, J. E. & Awad, M. On the explainability of natural language processing deep models. ACM Comput. Surv. 55, 1–31 (2022).
Chen, W., Ma, X., Wang, X. & Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Trans. Mach. Learn. Res. (2023).
Bereska, L. & Gavves, E. Mechanistic interpretability for AI safety—a review. Preprint at arXiv (2024).
Li, X. & Zhang, T. An exploration on artificial intelligence application: From security, privacy and ethic perspective. In Proc. IEEE International Conference on Cloud Computing and Big Data Analysis 416–420 (IEEE, 2017).
Wu, K. et al. Characterizing the clinical adoption of medical AI devices through us insurance claims. NEJM AI 1, AIoa2300030 (2023).
Suzanne, B. AI in Health: Keeping the Human in the Loop. Journal of the American Medical Informatics Association: JAMIA 30, 1225–26 (2023).
Singhvi, A. et al. DSPy assertions: Computational constraints for self-refining language model pipelines. Preprint at arXiv (2023).
Chase, H. Langsmith. (2024).
Lingjiao, L., Zaharia, M. & Zou, J. How Is ChatGPT’s Behavior Changing Over Time? Harvard Data Science Review. 6 (2024) https://doi.org/10.1162/99608f92.5317da47.
AI, C. DeepEval. (2023).
Mert, Y. Optimizing Generative AI by Backpropagating Language Model Feedback. Nature 639, 609–16 (2025).
Habicht, J. et al. Closing the accessibility gap to mental health treatment with a personalized self-referral chatbot. Nat. Med. 30, 595–602 (2024).
Google Scholar
Pais, C. et al. Large language models for preventing medication direction errors in online pharmacies. Nat. Med. 30, 1574–1582 (2024).
Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digit. Med. 7, 190 (2024).
Google Scholar
Zhou, J. et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SKINGPT-4. Nat. Commun. 15, 5649 (2024).
Google Scholar
Zhang, K. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat. Med. 1–13 (2024).
Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit. Med. 7, 102 (2024).
Google Scholar
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).
Google Scholar
Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of ChatGPT, Google Search and Llama 2 for clinical decision support tasks. Nat. Commun. 15, 2050 (2024).
Google Scholar
Keloth, V. K. et al. Advancing entity recognition in biomedicine via instruction tuning of large language models. Bioinformatics 40, btae163 (2024).
Google Scholar
Huang, J. et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. npj Digit. Med. 7, 106 (2024).
Google Scholar
He, H. et al. De novo generation of SARS-CoV-2 antibody CDRH3 with a pre-trained generative large language model. Nat. Commun. 15, 6867 (2024).
Google Scholar
link
