A new systematic review finds that only 5% of large-scale language model medical evaluations use real patient data, and that there are significant gaps in bias, fairness, and assessment of a wide range of tasks. This highlights the need for more comprehensive assessment methods.
Research: Testing and evaluation of large-scale language models for healthcare applications. Image credit: BOY ANTHONY/Shutterstock.com
In a recent study published in JAMA, researchers in the United States (US) conducted a systematic review of various aspects of existing large-scale language models (LLMs) used in medical applications, including medical tasks and data. assessment types) and where the application of an LLM is most useful in the medical field.
background
The use of artificial intelligence (AI) in healthcare has advanced rapidly, particularly with the development of LLMs. Unlike predictive AI, which is used to predict the outcome of a process, generative AI using LLM can create a wide range of new content, including images, audio, and text.
Based on user input, LLM can generate structured and largely consistent text responses, making it valuable in the healthcare field. LLM has already been applied to note-taking in some health systems in the United States and is also being considered in the medical field to improve efficiency and patient care.
However, the sudden interest in LLM has led to unstructured testing of LLM in various fields, and the performance of LLM in clinical practice has been mixed. Some studies have found that responses from LLMs are mostly superficial and often inaccurate, whereas other studies have found that responses from LLMs have been found to be as accurate as responses from human clinicians. I understand.
This discrepancy highlights the need to systematically evaluate the performance of LLMs in medical settings.
About research
For this comprehensive systematic review, researchers searched for preprints and peer-reviewed studies on LLM evaluation in healthcare published from January 2022 to February 2024. This two-year window was chosen to include papers published after the launch of the AI chatbot ChatGPT in November. 2022.
Three independent reviewers screened studies and included them in the review if they focused on LLM evaluation in healthcare. Basic biological research and research on complex issues were excluded.
The studies were then categorized based on the type of data evaluated, medical tasks, natural language processing (NLP) and natural language understanding tasks, medical specialty, and aspects of the evaluation. The classification framework was developed based on existing lists of medical tasks, established assessment models, and input from medical experts.
The classification framework considered whether actual patient data was evaluated and considered 19 healthcare tasks, including caregiving and administrative functions. Additionally, six NLP tasks are included in the classification, including summarization and question answering.
Additionally, seven evaluation dimensions were identified, including aspects such as factuality, accuracy, and toxicity. Studies were also grouped into 22 categories by medical specialty. The researchers then used descriptive statistics to summarize the findings and calculate proportions and frequencies for each category.
result
This review found that evaluation of LLMs in healthcare is uneven, with significant gaps in scope of tasks and data usage. Of the 519 studies included in the review, only 5% used real patient data, with most relying on expert-generated data snippets or health examination questions.
Most of the studies focused on LLM medical knowledge tasks, particularly through assessments such as the United States Medical Licensing Examination.
Patient care tasks, such as diagnosing patients and recommending treatments, were also relatively common among LLM tasks. However, administrative tasks such as writing clinical notes and assigning billing codes were rarely considered among LLM tasks.
Among NLP tasks, most studies focused on question answering, including common questions. Approximately 25% of the features used LLM for text classification and information extraction, but tasks such as conversational interaction and summarization were not fully explored through LLM evaluation.
The most frequently examined endpoint across LLM was accuracy (95.4%), followed by comprehensiveness (47%). Few studies used LLM for ethical considerations related to bias, toxicity, and fairness.
Although more than 20% of the studies were not specific to a particular medical specialty, internal medicine, ophthalmology, and surgery were most represented in LLM evaluation studies. Medical genetics and nuclear medicine studies were the least investigated in LLM evaluations.
conclusion
Overall, this review highlighted the need for standardized evaluation methods and consensus frameworks for evaluating LLM applications in healthcare.
Researchers believe that the use of real patient data in LLM evaluation should be encouraged, and that expanding the use of LLM for administrative tasks and the application of LLM to other medical specialties would be highly beneficial. said.
Reference magazines:
Bedi, S., Liu, Y., OrrEwing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J.A., Wornow, M., Swaminathan, A., Lehmann, L.S., Hong, H.J., Kashyap, M., Shaurasia, Aakash, R., Shah, N.R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M.A., and Shah, N.H. (2024). Testing and evaluation of large-scale language models for healthcare applications: A systematic review. jam. doi:10.1001/jama.2024.21700. https://jamanetwork.com/journals/jama/fullarticle/2825147