Skip to content
You should see a specialist

Passing part of a medical licensing exam doesn’t make ChatGPT a good doctor

The software's medical certification exam was OK, but its diagnoses aren't.

Jacek Krywko | 116
Smiling doctor discussing medical results with a woman.
For now, "you should see a doctor" remains good advice.
For now, "you should see a doctor" remains good advice.

ChatGPT was able to pass some of the United States Medical Licensing Exam (USMLE) tests in a study done in 2022. This year, a team of Canadian medical professionals checked to see if it’s any good at actual doctoring. And it’s not.

ChatGPT vs. Medscape

“Our source for medical questions was the Medscape questions bank,” said Amrit Kirpalani, a medical educator at the Western University in Ontario, Canada, who led the new research into ChatGPT’s performance as a diagnostic tool. The USMLE contained mostly multiple-choice test questions; Medscape has full medical cases based on real-world patients, complete with physical examination findings, laboratory test results, and so on.

The idea behind it is to make those cases challenging for medical practitioners due to complications like multiple comorbidities, where two or more diseases are present at the same time, and various diagnostic dilemmas that make the correct answers less obvious. Kirpalani’s team turned 150 of those Medscape cases into prompts that ChatGPT could understand and process.

This was a bit of a challenge because OpenAI, the company that made ChatGPT, has a restriction against using it for medical advice, so a prompt to straight-up diagnose the case didn’t work. This was easily bypassed, though, by telling the AI that diagnoses were needed for an academic research paper the team was writing. The team then fed it various possible answers, copy/pasted all the case info available at Medscape, and asked ChatGPT to provide the rationale behind its chosen answers.

It turned out that in 76 out of 150 cases, ChatGPT was wrong. But the chatbot was supposed to be good at diagnosing, wasn’t it?

Ars Video

 

Special-purpose tools

At the beginning of 2024. Google published a study on the Articulate Medical Intelligence Explorer (AMIE), a large language model purpose-built to diagnose diseases based on conversations with patients. AMIE outperformed human doctors in diagnosing 303 cases sourced from New England Journal of Medicine and ClinicoPathologic Conferences. And AMIE is not an outlier; during the last year, there was hardly a week without published research showcasing an AI performing amazingly well in diagnosing cancer and diabetes, and even predicting male infertility based on blood test results.

The difference between such specialized medical AIs and ChatGPT, though, lies in the data they have been trained on. “Such AIs may have been trained on tons of medical literature and may even have been trained on similar complex cases as well,” Kirpalani explained. “These may be tailored to understand medical terminology, interpret diagnostic tests, and recognize patterns in medical data that are relevant to specific diseases or conditions. In contrast, general-purpose LLMs like ChatGPT are trained on a wide range of topics and lack the deep domain expertise required for medical diagnosis.”

The lack of domain expertise manifested in how ChatGPT interpreted medical shades of gray. “Health care providers learn to look at lab values as part of a bigger picture, and we know that if the ‘normal range’ for a blood test result is ‘10–20’ that a value of 21 is very different from a value of 500,” Kirpalani said. ChatGPT, being ignorant of more nuanced medical knowledge, got side-tracked whenever test results were even slightly outside of the normal range.

But there was another, more grave issue. Part of the reason that AMIE and most other medical AIs are not publicly available is what they do when they are wrong. And what they do is exactly what ChatGPT does: They try to con you into thinking they're right.

Medical AI con man

While ChatGPT may have been wrong in diagnosing more than half of Medscape cases, the rationale behind the answers it offered, even when it was wrong, was really good. “This was both interesting and concerning. On the one hand, this tool is really effective at taking complex topics and simplifying explanations. On the other hand, it can be very convincing, even if it's wrong, because it explains things in such an understandable way,” Kirpalani said.

The problem with large language models, and all modern AIs in general, is that they have no real comprehension of the subject matter they talk or write about. All they do is predict what the next word in a sentence should be based on probabilities obtained from a huge amount of text (medical or not) that they ingested during training. Sometimes, this leads to AI hallucinations that make the responses kind of gibberish. But more often, chatbots make very compelling, well-structured, and well-written arguments for something that may not be true.

In Kirpalani’s study, there were a few cases where ChatGPT experienced some infamous AI hallucinations and was obviously way off the mark. In most cases, though, it was like a skilled public speaker with irresistible charisma, answering all the questions in plain, simple English with a special confidence. It can take some time before you realize he’s talking nonsense. “This finding raises the concern that it could be very misleading and potentially spread misinformation if the user isn't an expert on the topic,” Kirpalani claims.

Future AI doctors

There is no easy way to build a reliable AI doctor. “I think this will require a lot of data—essentially, these tools need to be trained on clinical data on a large scale and will definitely need a lot of oversight along the way. It's possible that some very specific tasks could be done by GPT or similar tools in the near future, but diagnosis of complicated cases often takes a lot of appreciation for nuance,” said Kirpalani.

He suggested we won’t see AIs doing full-on diagnosis or medical management any time soon. Instead, Kirpalani suggests, they will be used to enhance the work of human physicians, so human physicians are still where we need to go for medical advice.

Doctors are already using ChatGPT, though—during their medical education. “From our experience, ChatGPT has become systemic within medical school classrooms. Many medical students and trainees use it daily, whether for organizing their notes, clarifying diagnostic algorithms, or studying for exams,” said Edward Tran, a medical student at Western University and co-author of the study. Students may get conned by ChatGPT on some occasions, but they have their professors to set things right. The general public doesn’t have that.

“I would strongly advise against the general public using ChatGPT for medical advice at this time. There are some things that it does pretty well, but I think people should still be checking with their health care providers before making any health-related decisions based on ChatGPT,” Kirpalani said.

Photo of Jacek Krywko
Jacek Krywko Associate Writer
Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.
116 Comments