Application of large language models in clinical record correction : a comprehensive study on various retraining methods

Maitin, Ana M.Maitín, Ana MaríaNogales, AlbertoFernández-Rincón, SergioNogales Moyano, AlbertoAranguren, EnriqueCervera-Barba, EmilioDenizon-Arranz, SophiaMateos-Rodríguez, AlonsoCervera Barba, Emilio JuanGarcía-Tejedor, Álvaro J.Denizon Arranz, SophiaMateos-Rodríguez, Alonso A.García Tejedor, Álvaro José2026-01-272026-01-272025-02-01Maitin, A M, Nogales, A, Fernández-Rincón, S, Aranguren, E, Cervera-Barba, E, Denizon-Arranz, S, Mateos-Rodríguez, A & García-Tejedor, Á J 2025, 'Application of large language models in clinical record correction : a comprehensive study on various retraining methods', Journal of the American Medical Informatics Association : JAMIA, vol. 32, no. 2, pp. 341-348. https://doi.org/10.1093/jamia/ocae3021067-5027PubMedCentral: PMC11756697unpaywall: 10.1093/jamia/ocae302https://hdl.handle.net/10641/7515Publisher Copyright: © The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved.Objectives: We evaluate the effectiveness of large language models (LLMs), specifically GPT-based (GPT-3.5 and GPT-4) and Llama-2 models (13B and 7B architectures), in autonomously assessing clinical records (CRs) to enhance medical education and diagnostic skills. Materials and Methods: Various techniques, including prompt engineering, fine-tuning (FT), and low-rank adaptation (LoRA), were implemented and compared on Llama-2 7B. These methods were assessed using prompts in both English and Spanish to determine their adaptability to different languages. Performance was benchmarked against GPT-3.5, GPT-4, and Llama-2 13B. Results: GPT-based models, particularly GPT-4, demonstrated promising performance closely aligned with specialist evaluations. Application of FT on Llama-2 7B improved text comprehension in Spanish, equating its performance to that of Llama-2 13B with English prompts. Low-rank adaptation significantly enhanced performance, surpassing GPT-3.5 results when combined with FT. This indicates LoRA’s effectiveness in adapting open-source models for specific tasks. Discussion. While GPT-4 showed superior performance, FT and LoRA on Llama-2 7B proved crucial in improving language comprehension and task-specific accuracy. Identified limitations highlight the need for further research. Conclusion: This study underscores the potential of LLMs in medical education, providing an innovative, effective approach to CR correction. Low-rank adaptation emerged as the most effective technique, enabling open-source models to perform on par with proprietary models. Future research should focus on overcoming current limitations to further improve model performance.8440099enghttp://creativecommons.org/licenses/by-nc-nd/4.0/LLMsartificial intelligenceclinical recordsretrainingHealth InformaticsJournal ArticleResearch Support, Non-U.S. Gov'tYesyesApplication of large language models in clinical record correction : a comprehensive study on various retraining methodsjournal articleopen access10.1093/jamia/ocae302https://www.scopus.com/pages/publications/85216606517https://www.scopus.com/pages/publications/85216606517#tab=citedBy