Possibilities and limitations of using a large language model to respond to patient messages

A new study by researchers at Mass General Brigham shows that large language models (LLMs), a type of generative AI, can help reduce physician workload and improve patient education when used to craft responses to patient messages. The study also identified limitations to LLMs that may impact patient safety, suggesting that vigilant monitoring of LLM-generated communications is essential for safe use. Findings, published in Lancet Digital Healthemphasize the need for a measured approach to LLM implementation.

Increasing administrative and documentation responsibilities have contributed to an increase in physician burnout. To help streamline and automate physician workflows, electronic health record (EHR) vendors have implemented generative AI algorithms to help physicians craft messages to patients; however, the efficacy, safety, and clinical impact of its use were unknown.

Generative AI has the potential to provide a ‘best of both worlds’ scenario, reducing the burden on the physician while better educating the patient. However, based on our team’s experience working with LLMs, we are concerned about the potential risks associated with integrating LLMs into messaging systems. As LLM integration into EHRs becomes increasingly common, our goal in this study was to identify relevant benefits and shortcomings.”

Danielle Bitterman, MD, Corresponding author, faculty member of the Artificial Intelligence in Medicine (AIM) program at Mass General Brigham and physician in the Department of Radiation Oncology at Brigham and Women’s Hospital

For the study, the researchers used OpenAI’s GPT-4, a foundational LLM, to generate 100 scenarios about patients with cancer and an associated patient question. No questions from real patients were used for the study. Six radiation oncologists answered the questions manually; then GPT-4 generated answers to the questions. Finally, the same radiation oncologists were given the LLM-generated responses for review and editing. The radiation oncologists did not know whether GPT-4 or a human had written the answers, and in 31% of cases they believed that an LLM-generated answer was written by a human.

On average, physician-generated responses were shorter than LLM-generated responses. GPT-4 tended to include more educational background for patients but was less directive in its instructions. The physicians reported that the LLM assistance improved their perceived efficiency and considered the responses generated by the LLM safe 82.1 percent of the time and acceptable to send to a patient without further editing 58.3 percent of the time. The researchers also identified some shortcomings: If left unedited, 7.1 percent of responses generated by the LLM could pose a risk to the patient and 0.6 percent of responses could pose a risk of death, mostly because the response of GPT-4 failed to urgently instruct the patient. to seek immediate medical attention.

Notably, the LLM-generated/clinician-edited responses were more similar in length and content to the LLM-generated responses than to the manual responses. In many cases, physicians retained LLM-generated educational content, indicating that they viewed it as valuable. While this may aid patient education, the researchers emphasize that over-reliance on LLMs may also pose risks, given their demonstrated shortcomings.

The rise of AI tools in healthcare has the potential to positively reshape the continuum of care and it is imperative to balance their innovative potential with a commitment to safety and quality. Mass General Brigham is at the forefront of the responsible use of AI and conducts rigorous research into new and emerging technologies to inform the integration of AI into healthcare delivery, workforce support and administrative processes. Mass General Brigham is currently leading a pilot integrating generative AI into the electronic health record to craft responses to patient portal messages, testing the technology in a range of ambulatory practices across the healthcare system.

Going forward, the study authors investigate how patients perceive LLM-based communication and how patients’ racial and demographic characteristics influence LLM-generated responses, based on known algorithmic biases in LLMs.

“Keeping a human informed is an essential safety step when it comes to using AI in medicine, but it is not a single solution,” Bitterman said. “As providers rely more on LLMs, we could miss errors that could lead to patient harm. This study demonstrates the need for systems to monitor the quality of LLMs, training for physicians to appropriately monitor LLM output, and increased AI literacy for both patients and physicians and, at a fundamental level, a better understanding of how to address the mistakes LLMs make.”

Source:

Magazine reference:

Chen, S., et al. (2024) The effect of using a large language model to respond to patient messages. The Lancet Digital Health. doi.org/10.1016/S2589-7500(24)00060-8.

Related Posts