A perspective for adapting generalist AI to specialized medical AI applications and their challenges

A perspective for adapting generalist AI to specialized medical AI applications and their challenges

  • Choi, E. et al. Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. In Proc. Advances in Neural Information Processing Systems. Vol. 29 (Curran Associates, Inc., 2016).

  • Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: Contrastive learning from unpaired medical images and text. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing. 3876–3887 (Association for Computational Linguistics, 2022).

  • Wang, Z. & Sun, J. PromptEHR: conditional electronic healthcare records generation with prompt learning. In Proc. Conference on Empirical Methods in Natural Language Processing. 2873–2885. (Association for Computational Linguistics, 2022).

  • Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    CAS 
    PubMed 

    Google Scholar 

  • Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems. Vol. 33, 1877–1901 (Curran Associates, Inc., 2020).

  • Wan, P. et al. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat. Med. 30, 2878–2885 (2024).

  • Yang, J. et al. Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond. ACM Trans. Knowl. Discov. Data 18, 1–32 (2024).

  • Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform. 25, bbad493 (2024).

    PubMed Central 

    Google Scholar 

  • Au Yeung, J. et al. AI chatbots not yet ready for clinical use. Front. Digit. Health 5, 1161098 (2023).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329, 842–844 (2023).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proc. Advances in Neural Information Processing Systems. Vol. 33, 9459–9474 (Curran Associates, Inc., 2020).

  • Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. In Proc. Findings of the Association for Computational Linguistics: ACL 2024 (eds Ku, L.-W., Martins, A. & Srikumar, V.) 6233–6251. (Association for Computational Linguistics, Bangkok, Thailand, 2024).

  • Nakano, R. et al. WebGPT: Browser-assisted question-answering with human feedback. Preprint at arXiv (2021).

  • Jin, Q. et al. AgentMD: empowering language agents for risk prediction with large-scale clinical tool learning. Preprint at arXiv (2024).

  • Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. Medicine 84, 77–3 (2023).

    Google Scholar 

  • Zaharia, M. et al. The shift from models to compound AI systems. (2024).

  • Lin, J., Xu, H., Wang, Z., Wang, S. & Sun, J. Panacea: a foundation model for clinical trial search, summarization, design, and recruitment. Preprint at (2024).

  • Wang, H. et al. Towards adapting open-source large language models for expert-level clinical note generation. Preprint at arXiv (2024).

  • Khattab, O. et al. DSPy: compiling declarative language model calls into self-improving pipelines. In Proc. R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models (2023).

  • Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. Preprint at arXiv (2023).

  • Chen, Z. et al. MEDITRON-70b: scaling medical pretraining for large language models. Preprint at arXiv (2023).

  • Jiang, A. Q. et al. Mixtral of experts. Preprint at arXiv (2024).

  • Wei, J. et al. Finetuned language models are zero-shot learners. In Proc. International Conference on Learning Representations. OpenReview.net (2021).

  • Li, X. L. & Liang, P. Prefix-Tuning: optimizing continuous prompts for generation. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Vol. Long Papers, 4582–4597 (Association for Computational Linguistics (ACL), 2021).

  • Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In Proc. International Conference on Learning Representations. OpenReview.net (2023).

  • Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proc. Advances in Neural Information Processing Systems. Vol. 35, 27730–27744 (Curran Associates, Inc., 2022).

  • Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. Advances in Neural Information Processing Systems. Vol. 35, 24824–24837 (Curran Associates, Inc., 2022).

  • Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. In Proc. Eleventh International Conference on Learning Representations. OpenReview.net (2022).

  • Chen, L. et al. Are more LLM calls all you need? towards scaling laws of compound inference systems. Preprint at arXiv (2024).

  • Jin, Q. et al. Matching patients to clinical trials with large language models. Nat. Commun. 15, 9074 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E. & Singh, S. AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4222–4235 (Association for Computational Linguistics (ACL), 2020).

  • Cheng, J. et al. Black-box prompt optimization: aligning large language models without model training. Preprint at arXiv (2023).

  • Wen, Y., Wang, Z. & Sun, J. MindMap: knowledge graph prompting sparks graph of thoughts in large language models. Preprint at arXiv (2023).

  • Zakka, C. et al. Almanac-retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).

    Google Scholar 

  • Arasteh, S. T. et al. RadioRAG: factual large language models for enhanced diagnostics in radiology using dynamic retrieval augmented generation. Preprint at arXiv (2024).

  • Wang, Z. & Sun, J. Trial2vec: Zero-shot clinical trial document similarity search using self-supervision. In Proc. Findings of the Association for Computational Linguistics: EMNLP 2022 6377–6390 (2022).

  • Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 39, btad651 (2023).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Wu, T., Terry, M. & Cai, C. J. AI chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proc. 2022 CHI Conference on Human Factors in Computing Systems 1–22 (2022).

  • Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proc. Eleventh International Conference on Learning Representations. OpenReview.net (2022).

  • Wang, Z. et al. Accelerating clinical evidence synthesis with large language models. (2024).

  • Wang, Z., Danek, B., Yang, Z., Chen, Z. & Sun, J. Can large language models replace data scientists in clinical research? Preprint at arXiv (2024).

  • Asai, A. et al. OpenScholar: synthesizing scientific literature with retrieval-augmented language models. Preprint at Arxiv (2024).

  • Introducing deep research. (2025).

  • Gravitas, S. Autogpt. (2023).

  • Wu, Q. et al. AutoGen: enabling next-gen LLM applications via multi-agent conversation framework. Preprint at arXiv (2023).

  • Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 15, 1603 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. Nature 625, 476–482 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Gu, Y. et al. Middleware for LLMs: tools are instrumental for language agents in complex environments. Preprint at arXiv (2024).

  • Gao, T., Yen, H., Yu, J. & Chen, D. Enabling large language models to generate text with citations. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 6465–6488 (2023).

  • Kim, Y. et al. MDAgents: an adaptive collaboration of LLMs for medical decision-making. In Proc. Advances in Neural Information Processing Systems. Vol. 37, 79410–79452 (2025).

  • Mukherjee, S. et al. Polaris: a safety-focused llm constellation architecture for healthcare. Preprint at arXiv (2024).

  • Huang, K. et al. Automated hypothesis validation with agentic sequential falsifications. Preprint at arXiv (2025).

  • Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. The virtual lab: AI agents design new SARS-CoV-2 nanobodies with experimental validation. Preprint at bioRxiv (2024).

  • Semnani, S., Yao, V., Zhang, H. C. & Lam, M. WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (2023).

  • Chen, L., Zaharia, M. & Zou, J. FrugalGPT: How to use large language models while reducing cost and improving performance. Preprint at arXiv (2023).

  • Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Digit. Med. 6, 158 (2023).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Bhaskar, A., Fabbri, A. & Durrett, G. Prompted opinion summarization with GPT-3.5. In Proc. Findings of the Association for Computational Linguistics: ACL 2023 9282–9300 (2023).

  • Park, J. et al. Criteria2query 3.0: leveraging generative large language models for clinical trial eligibility query generation. J. Biomed. Inform. 154, 104649 (2024).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Datta, S. et al. AutoCriteria: a generalizable clinical trial eligibility criteria extraction system powered by large language models. J. Am. Med. Inform. Assoc. 31, 375–385 (2024).

    PubMed 

    Google Scholar 

  • Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Proc. Conference on Empirical Methods in Natural Language Processing 1998–2022 (2022).

  • Jeong, D., Garg, S., Lipton, Z. C. & Oberst, M. Medical adaptation of large language and vision-language models: Are we making progress? In Proc. 2024 Conference on Empirical Methods in Natural Language Processing, 12143–12170 (2024).

  • Zhang, G. et al. Closing the gap between open source and commercial large language models for medical evidence summarization. npj Digit. Med. 7, 239 (2024).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Klang, E. et al. A strategy for cost-effective large language model use at health system-scale. npj Digit. Med. 7, 320 (2024).

    PubMed 
    PubMed Central 

    Google Scholar 

  • OpenAI. Function calling and other api updates. (2023).

  • Chase, H. LangChain. (2022).

  • Jin, Q., Yang, Y., Chen, Q. & Lu, Z. GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics 40, btae075 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 47, D23 (2019).

    CAS 
    PubMed 

    Google Scholar 

  • Lin, C-Y. ROUGE: a package for automatic evaluation of summaries. In Proc. Text Summarization Branches Out. 74–81 (Association for Computational Linguistics, 2004).

  • Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTscore: evaluating text generation with BERT. In Proc. International Conference on Learning Representations. OpenReview.net (2019).

  • Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134-1142 (2024).

  • Abacha, A. B., Yim, W.-w., Adams, G., Snider, N. & Yetisgen-Yildiz, M. Overview of the MEDIQA-Chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In Proc. 5th Clinical Natural Language Processing Workshop. 503–513 (Association for Computational Linguistics, 2023).

  • Nelson, H. Epic announces ambient clinical documentation EHR integration. Accessed 5 September 2024, Accessed on Jun 2025 (2023).

  • Yim, W.-w. et al. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Sci. Data 10, 586 (2023).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Soroush, A. et al. Large language models are poor medical coders-benchmarking of medical code querying. NEJM AI 1, AIdbp2300040 (2024).

    Google Scholar 

  • Wang, H., Gao, C., Dantona, C., Hull, B. & Sun, J. Drg-llama: tuning llama model to predict diagnosis-related group for hospitalized patients. npj Digit. Med. 7, 16 (2024).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Topaz, M., Shafran-Topaz, L. & Bowles, K. H. ICD-9 to ICD-10: evolution, revolution, and current debates in the united states. Perspect. Health Inf. Manag. 10, 1d (2013).

  • Peggy, D. CPT® Codes: What Are They, Why Are They Necessary, and How Are They Developed? Advances in Wound Care 2, 583–587 (2013).

  • ICD-10-CM/PCS MS-DRG V40.1 Definitions Manual. Accessed June 23, 2025. https://www.cms.gov/icd10m/fy2023-version40.1-fullcode-cms/fullcode_cms/P0006.html.

  • Wornow, M. et al. Zero-shot clinical trial patient matching with LLMs. Preprint at arXiv (2024).

  • Nievas, M., Basu, A., Wang, Y. & Singh, H. Distilling large language models for matching patients to clinical trials. J. Am. Med. Inform. Assoc. 31, 1953–1963 (2024).

  • Unlu, O. et al. Retrieval-augmented generation–enabled GPT-4 for clinical trial screening. NEJM AI 1, AIoa2400181 (2024).

  • Wong, C. et al. Scaling clinical trial matching using large language models: a case study in oncology. In Proc. Machine Learning for Healthcare Conference. 846–862 (Proceedings of Machine Learning Research (PMLR), 2023).

  • Shaib, C. et al. Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success). In Proc. Annual Meeting Of The Association For Computational Linguistics. (Association for Computational Linguistics, 2023).

  • Yun, H., Marshall, I., Trikalinos, T. & Wallace, B. Appraising the potential uses and harms of LLMs for medical systematic reviews. In Proc. Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2023).

  • Page, M. J. et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. Syst. Rev. 10, 89 (2021).

  • Wang, Z. et al. A foundation model for human-ai collaboration in medical literature mining. Preprint at arXiv (2025).

  • Wang, S., Harrisen, S., Bevan, K. & Guido Z. Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search? In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1426–36 (NewYork, NY, USA: ACM, 2023).

  • National Institute of Standards and Technology. NIST Privacy Framework: A Tool for Improving Privacy through Enterprise Risk Management, Version 1.0. Technical Report (National Institute of Standards and Technology, 2020).

  • National Institute of Standards and Technology. The NIST Cybersecurity Framework (CSF) 2.0. Technical Report NIST CSWP 29 (U.S. Department of Commerce, 2024).

  • Quentin, C., Steinhagen, D., Francis, M. & Streff, K. Towards a Triad for Data Privacy. In Proceedings of the Annual Hawaii International Conference on System Sciences. (Hawaii International Conference on System Sciences (2020) https://doi.org/10.24251/hicss.2020.535.

  • Tabassi, E. Artificial Intelligence Risk Management Framework (AI RMF 1.0): AI RMF (1.0). Technical Report (2023).

  • Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at arXiv [cs.CL] (2023).

  • Chaves, J. M. Z. et al. Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. Preprint at arXiv [cs.CL] (2024).

  • Zhang, S. et al. Large-scale domain-specific pretraining for biomedical vision-language processing. Preprint at arXiv (2023).

  • Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).

    CAS 
    PubMed 

    Google Scholar 

  • Lu, M. Y. et al. A visual-language foundation model for computational pathology. Nat. Med. 30, 863–874 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Li, C. et al. LLaVA‑Med: Training a Large Language‑and‑Vision Assistant for Biomedicine in One Day. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) (Datasets & Benchmarks Spotlight) (2023).

  • Gu, Y. et al. BiomedJourney: Counterfactual biomedical image generation by instruction-learning from multimodal patient journeys. Preprint at arXiv [cs.CV] (2023).

  • Bluethgen, C. et al. A vision-language foundation model for the generation of realistic chest X-ray images. Nat. Biomed. Eng. 9, 494–506 (2025) .

  • Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Lei, H. et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems. 43, 1–55 (2025).

  • Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

    CAS 
    PubMed 

    Google Scholar 

  • Kevin, W. et al. An Automated Framework for Assessing How Well LLMs Cite Relevant Medical References. Nat. Commun. 16, 3615 (2025).

  • Sams, C. M., Fanous, A. H. & Daneshjou, R. Human-artificial intelligence interaction research is crucial for medical artificial intelligence implementation. J Investig. Dermatol. (2024).

  • Carlini, N. et al. Extracting training data from large language models. In Proc. 30th USENIX Security Symposium (USENIX Security 21). 2633–2650 (USENIX Association, 2021).

  • Theodorou, B., Xiao, C. & Sun, J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat. Commun. 14, 5305 (2023).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Das, T., Wang, Z. & Sun, J. TWIN: Personalized clinical trial digital twin generation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 402–413 (Association for Computing Machinery (ACM), 2023). https://doi.org/10.1145/3580305.3599370.

  • Torfi, A., Fox, E. A. & Reddy, C. K. Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 586, 485–500 (2022).

    Google Scholar 

  • Schaeffer, R., Miranda, B. & Koyejo, S. Are emergent abilities of large language models a mirage? In Proc. Advances in Neural Information Processing Systems. Vol. 36 (Curran Associates, Inc., 2024).

  • Zini, J. E. & Awad, M. On the explainability of natural language processing deep models. ACM Comput. Surv. 55, 1–31 (2022).

    Google Scholar 

  • Chen, W., Ma, X., Wang, X. & Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Trans. Mach. Learn. Res. (2023).

  • Bereska, L. & Gavves, E. Mechanistic interpretability for AI safety—a review. Preprint at arXiv (2024).

  • Li, X. & Zhang, T. An exploration on artificial intelligence application: From security, privacy and ethic perspective. In Proc. IEEE International Conference on Cloud Computing and Big Data Analysis 416–420 (IEEE, 2017).

  • Wu, K. et al. Characterizing the clinical adoption of medical AI devices through us insurance claims. NEJM AI 1, AIoa2300030 (2023).

    Google Scholar 

  • Suzanne, B. AI in Health: Keeping the Human in the Loop. Journal of the American Medical Informatics Association: JAMIA 30, 1225–26 (2023).

  • Singhvi, A. et al. DSPy assertions: Computational constraints for self-refining language model pipelines. Preprint at arXiv (2023).

  • Chase, H. Langsmith. (2024).

  • Lingjiao, L., Zaharia, M. & Zou, J. How Is ChatGPT’s Behavior Changing Over Time? Harvard Data Science Review. 6 (2024) https://doi.org/10.1162/99608f92.5317da47.

  • AI, C. DeepEval. (2023).

  • Mert, Y. Optimizing Generative AI by Backpropagating Language Model Feedback. Nature 639, 609–16 (2025).

  • Habicht, J. et al. Closing the accessibility gap to mental health treatment with a personalized self-referral chatbot. Nat. Med. 30, 595–602 (2024).

    CAS 
    PubMed 

    Google Scholar 

  • Pais, C. et al. Large language models for preventing medication direction errors in online pharmacies. Nat. Med. 30, 1574–1582 (2024).

  • Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digit. Med. 7, 190 (2024).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Zhou, J. et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SKINGPT-4. Nat. Commun. 15, 5649 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Zhang, K. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat. Med. 1–13 (2024).

  • Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit. Med. 7, 102 (2024).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of ChatGPT, Google Search and Llama 2 for clinical decision support tasks. Nat. Commun. 15, 2050 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Keloth, V. K. et al. Advancing entity recognition in biomedicine via instruction tuning of large language models. Bioinformatics 40, btae163 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Huang, J. et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. npj Digit. Med. 7, 106 (2024).

    PubMed 
    PubMed Central 

    Google Scholar 

  • He, H. et al. De novo generation of SARS-CoV-2 antibody CDRH3 with a pre-trained generative large language model. Nat. Commun. 15, 6867 (2024).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • link

    Leave a Reply

    Your email address will not be published. Required fields are marked *