A scalable framework for evaluating health language models

Summarize this content to 100 words: Arora, A. & Arora, A. The promise of large language models in health care. Lancet 401, 641 (2023).
Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature (2022).McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Google Scholar
Shi, W. et al. EHRAgent: Code empowers large language models for Few-Shot complex tabular reasoning on electronic health records. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 22315–22339 (2024).Gottweis, J. et al. Towards an AI Co-Scientist. arXiv preprint (2025).Khasentino, J. et al. A personal health large language model for sleep and fitness coaching. Nat. Med. https://doi.org/10.1038/s41591-025-03888-0 (2025).Merrill, M.A. et al. Transforming wearable data into personal health insights using large language model agents. Nat. Commun. 17, 1143 (2026).
Google Scholar
Heydari, A. A. et al. The anatomy of a personal health agent https://arxiv.org/abs/2508.20148 (2025).Fraser, H. et al. Comparison of diagnostic and triage accuracy of Ada Health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: Clinical data analysis study. JMIR mHealth uHealth 11, e49995 (2023).
Google Scholar
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Google Scholar
Chang, C. T. et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. npj Digit. Med. 8, 149 (2025).
Google Scholar
Yang, Z., Meng, Z., Zheng, X. & Wattenhofer, R. Assessing adversarial robustness of large language models: An empirical study https://arxiv.org/abs/2405.02764 (2024).Cao, Y. et al. Toward generalizable evaluation in the LLM era: A survey beyond benchmarks https://arxiv.org/abs/2504.18838 (2025).Liang, P. et al. Holistic evaluation of language models. Transactions on Machine Learning Research (TMLR) (2022).Likert, R.A Technique for the Measurement of Attitudes (Archives of Psychology, 22, 1932).Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, (1977).Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating text generation with BERT. ICLR 2020 (2019).Westland, J. C. Information loss and bias in Likert survey responses. PLoS ONE 17, e0271949 (2022).
Google Scholar
Elangovan, A., Liu, L., Xu, L., Bodapati, S. & Roth, D. ConSiDERS-The-Human evaluation framework: Rethinking human evaluation for generative large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024).Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Parameswaran, A. G. & Arawjo, I. Who validates the validators? aligning LLM-assisted evaluation of LLM outputs with human preferences. UIST ’24: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (2024).Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers (2023).Chang, Y. et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (TIST) (2023).Guo, Z. et al. Evaluating large language models: A comprehensive survey. arXiv preprint (2023).Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. npj Digit. Med. 7, 1–14 (2024).
Google Scholar
Weidinger, L. et al. Sociotechnical safety evaluation of generative AI systems. arXiv preprint (2023).Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit. Med. 7, 1–20 (2024).
Google Scholar
Vu, T. et al. Foundational autoraters: Taming large language models for better automatic evaluation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (2024).Zhong, M. et al. Towards a unified Multi-Dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2023–2038 (2022).Min, S. et al. FactScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12076–12100 (2023).Lee, Y. et al. CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists. HEAL Workshop at CHI (2024).Dasgupta, S., Frost, N., Moshkovitz, M. & Rashtchian, C. Explainable k-Means and k-Medians clustering. Proceedings of the 37th International Conference on Machine Learning (2020).Gemini Team, G. Gemini: A family of highly capable multimodal models. arXiv preprint https://arxiv.org/abs/2312.11805 (2023).Saab, K. et al. Capabilities of Gemini models in medicine. arXiv preprint (2024).Yang, L. et al. Advancing multimodal medical capabilities of Gemini. arXiv preprint (2024).Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 330, 78–80 (2023).
Google Scholar
StatPearls. Ace the endocrinology, diabetes, & metabolism exam. https://www.statpearls.com/boardreview/Endocrinology (2025).American Board of Internal Medicine. Endocrinology, diabetes, & metabolism exam scoring. https://www.abim.org/maintenance-of-certification/assessment-information/endocrinology-diabetes-metabolism/scoring-results (2025).American Board of Internal Medicine. Initial certification pass rates. https://www.abim.org/Media/yeqiumdc/certification-pass-rates.pdf (2024).BoardVitals. Cardiology board review questions [2025] – BoardVitals. https://www.boardvitals.com/cardiology-board-review/?utm_term=&utm_campaign=Performance+Max+-+BoardReview&utm_source=google&utm_medium=cpc&hsa_acc=3629361371&hsa_cam=16996727962&hsa_grp=&hsa_ad=&hsa_src=x&hsa_tgt=&hsa_kw=&hsa_mt=&hsa_net=adwords&hsa_ver=3&utm_content=april-flash&gad_source=1&gclid=EAIaIQobChMI7tewscPJiwMVAQCtBh3AiAH2EAAYASAAEgI_S_D_BwE (2025).Panickssery, A., Bowman, S. R. & Feng, S. LLM evaluators recognize and favor their own generations. In The 38th Annual Conference on Neural Information Processing Systemshttps://openreview.net/forum?id=4NJBV6Wp0h (2024).Liu, A. et al. DeepSeek-V3 technical report https://arxiv.org/abs/2412.19437 (2025).Hurst, A. et al. GPT-4o system card https://arxiv.org/abs/2410.21276 (2024).OpenAI. OpenAI o3 Model. https://openai.com/index/introducing-o3-and-o4-mini/ (2025).Prieto, J. L. New Fitbit study explores metabolic health. https://blog.google/products/fitbit/new-quest-fitbit-study-metabolic-health/ (2024).Metwally, A. A. et al. Insulin resistance prediction from wearables and routine blood biomarkers. Nature (In-Press). https://arxiv.org/abs/2505.03784.The Cleveland Clinic. Hypercholesterolemia, Cleveland Clinic. https://my.clevelandclinic.org/health/diseases/23921-hypercholesterolemia (2025).Fabbri, A. R. et al. SummEval: Re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. (2020).Gopalakrishnan, K. et al. Topical-Chat: Towards Knowledge-Grounded Open-Domain conversations. INTERSPEECH (2019).Clark, E. et al. All that’s ’human’ is not gold: Evaluating human evaluation of generated text https://arxiv.org/abs/2107.00061 (2021).Pfohl, S. R. et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 30, 3590–3600 (2024).
Google Scholar
Gehrmann, S., Clark, E. & Sellam, T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. https://arxiv.org/abs/2202.06935 (2022).Yu, F. When AIs judge AIs: The rise of Agent-as-a-Judge evaluation for LLMs https://arxiv.org/abs/2508.02994 (2025).Fisher, S. R. A.Statistical Methods for Research Workers (Oliver & Boyd (Edinburgh), 1925).Bartko, J. J. The intraclass correlation coefficient as a measure of reliability. Psychol. Rep. 19, (1966).Shrout, P. & Fleiss, J. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).
Google Scholar
Liljequist, D., Elfving, B. & Roaldsen, K. S. Intraclass correlation – a discussion and demonstration of basic features. PLOS ONE 14, e0219854 (2019).
Google Scholar
Hackl, V., Müller, A. E., Granitzer, M. & Sailer, M. Is GPT-4 a reliable rater? evaluating consistency in GPT-4’s text ratings. Front. Educ. 8, 1272229 (2023).
Google Scholar

Source link

A scalable framework for evaluating health language models

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

Updating Classifier Evasion for Vision Language Models

We Tested The New Qwen3.5 Open Weight, Qwen3.5-Plus

Low-input deep learning platform for citrullinated peptide identification, autoantigen discovery and rheumatoid arthritis treatment stratification

How to Build a Document Processing Pipeline for RAG with Nemotron