Human clinicians use a combination of image recognition (based on what we’ve seen before) and semantic reasoning (based on what we’ve read or heard about) to diagnose skin conditions. The additional dimension of clinical context has historically given people a competitive edge over computer vision programs. Many conditions, especially rare ones with vague presentations, are impossible to diagnose purely visually and will only be recognized with careful history-taking in addition to visual inspection.
Here, we demonstrate the capability of large language model technology, in particular GPT4o, to convincingly use clinical context to enhance dermatological image interpretation in ways analogous to the semantic reasoning of a real dermatologist. An important caveat, consistent with recent cautionary research findings from Apple, is that the “reasoning” of GPT4o is susceptible to influence from erroneous or irrelevant information.
When the right context is gathered and submitted to the LLM at the right time, as is the proprietary capability of MDandMe, our GPT4o-based dermatology framework can achieve >80% accuracy across a complex diagnosis-verified dataset of 113 images featuring 88 conditions. If key context is not collected or used, as is the tendency of GPT4o on ChatGPT, the LLM more frequently anchors on the wrong path and achieves only around 50% accuracy on the same dataset. By comparison, Google Lens, a traditional computer vision program that does not use clinical context, found the right answer in 52% of cases. See Appendix Figure 1 for details.
Google Lens outperformed all LLMs, even with context, on some rare but distinctive phenomena such as flagellate shiitake dermatitis. However, it underperformed on common diagnoses with more vague visual characteristics, such as acne and spider bites, as well as diagnoses that are dependent on clinical history, such as Behçet's Syndrome: a chronic condition leading to repeated painful sores on mucous membranes. See Appendix for a more detailed performance breakdown and raw data.
Computer Vision has been around since the 60s. What’s the big deal?
Traditional computer vision programs, which largely rely on matching pixel patterns uninterpretable to the human mind, have superseded dermatologist’ visual capabilities for select tasks in recent years. These tasks have been limited to narrow use cases, such as screening for skin cancers like melanoma. In general dermatology, where recognition tasks are broad and not standardized, humans have the definitive upper hand.
As of late 2024, all of the major commercial LLMs have introduced “multi-modal” capabilities that can interpret images in addition to text. In theory, this suggests a new era in computer vision in which pixel patterns can be matched to human-verifiable textual descriptions of dermatological phenomena (eg. clustered erythematous vesicles arranged in a dermatomal distribution) and then associated with relevant diagnoses (in this case shingles). This method would allow AI to recognize dermatological conditions based on the same descriptions that dermatologists study in medical literature rather than necessarily having to be trained on immense amounts of annotated images. Semantic “reasoning” may even allow LLMs to accurately diagnose an image of a rare condition that it “read” about without ever being exposed to an image of it before.
In a set of head-to-head trials between today’s most powerful LLMs (see Appendix Figure 2), only GPT4o could consistently provide medically accurate descriptions of dermatological images and use this to meaningfully improve its outputs when provided with non–specific clinical context. The context was designed to not be diagnostic by itself but increase the probability of certain diagnoses; for example: “34-year-old man with finding under arm. Denies itching, pain, or other symptoms. Takes metformin for T2D.” Anthropic’s Claude 3.5 Sonnet comes in second place, providing the best guess with clinical context alone but only demonstrates marginal synergy when image and context are combined.
Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.2 vision, while familiar with dermatological language, have a tendency to hallucinate and use the wrong terminology when describing images. Inaccurate LLM-produced descriptions as well as injection of inaccurate user descriptions can bias the interpretation of the condition, especially in weaker models. In the case of Llama 3.2 vision and Gemini 1.5 Pro, mischaracterization of the image is often so severe that adding dermatological images interfere with the models’ performance relative to just using clinical context. In other words, these models are better off guessing rather than attempting to “see.”
What we are doing at MDandMe
At MDandMe, we provide an AI physician best friend to help people make sense of their health at all steps throughout their healthcare journey. We rigorously test, compare, and deploy the latest LLMs to do tasks at which they are strongest.
For dermatological conditions, we have designed a framework that will gather key clinical context given a user-submitted image. We mitigate the bias and premature anchoring seen with image submission to ChatGPT and other chatbots by only assessing the situation after relevant details have been gathered. We also will selectively remove under-informed interpretations from our memory if more information becomes available.
Give our AI, Arora, a try today at mdme.ai/chat
Why LLMs are not a threat to real dermatologists
We anticipate that LLMs will help patients more accurately treat minor skin conditions with over-the-counter topical medications and understand when a doctor’s visit is required. Our research also suggests that properly leveraged LLMs may be an asset to primary care physicians when confronted with less common dermatological concerns. However, given that LLMs lack definitive judgement capabilities and that the demand for dermatologists far exceeds their ability to schedule timely appointments, we do not foresee LLMs competing directly with human dermatologists.
We anticipate significant improvement in multi-modal LLMs coming years, but in-person visits with dermatologists also have a few additional physical advantages that are difficult for software to reproduce. In particular, especially in older populations, many if not most concerning lesions are found through dermatologist-led skin checks as opposed to patient self-reporting. While dermatoscopes may become more readily available for at-home AI-guided use, shave biopsies and in-office procedures such as cryotherapy are likely to remain limited to licensed professionals only.
Appendix
Figure 1. Accuracy of diagnoses were compared between ChatGPT (the top general purpose LLM from our studies), Google Lens (the most widely accessible traditional computer vision program), and MDandMe. We gave ChatGPT information about the user's age and one presenting complaint, and classified Google's response as incorrect only if the correct diagnosis did not show up on the first entire page of related images, as all would be immediately viewable to a user. For most conditions tested, MDandMe outperforms both ChatGPT and Google Lens as ChatGPT, as of date of writing, fails to ask for relevant clinical context and Google Lens does not use clinical context.
Figure 2. We tested ability of leading LLMs with vision capabilities to diagnose 113 dermatologic images, with and without clinical context. No-image clinical context trials served as controls for cases where the LLM could guess the diagnosis "blind" based simply on the clinical context. The models tested were ChatGPT (gpt-4o-2024-11-20 and chatgpt-4o-latest), Anthropic Claude 3.5 Sonnet, Google Gemini-1.5-pro, and Meta Llama-3.2-11B-vision. GPT4o was the definitive best performing LLM in all categories, and chatgpt-4o-latest specifically was marginally superior to gpt-4o-2024-11-20.
Figure 3. Examples of LLM generated image descriptions and proposed diagnoses, edited for brevity. Original results are available in the dataset. LLMs are autoregressive, meaning that all previously generated tokens, in addition the initial prompt will affect the probability distribution of the next token in its response. Thus, inaccurate descriptions of images are associated with misdiagnoses that are consistent with the description but not the image. This also means that having the LLM repeat certain clinical characteristics in its response can help increase the consideration of that context in determining the diagnosis. In the examples above, despite the same prompt, GPT4o was the only one that arrived at the right answer, and was the only one to accurately pick up on key image characteristics and clinical context.
Methods
Direct performance comparison trails were conducted between API-accessed OpenAI chatgpt-latest and gpt4o-11-20-2024, Anthropic claude-3-5-sonnet-20241022, Google gemini-1.5-pro and Meta llama3.2-11b-vision (llama3.2-70b-vision was significantly slower and not notably superior so it was omitted). Repeats using the same model and setup yielded results that were scored within 2% of each other.
Clinical context used was designed to provide information that was not diagnostic by itself but would increase the likelihood of the correct diagnosis; for example: “48F with patch without hair on her scalp. Hair otherwise normal. Medical history includes RA, and high ANA.” This is analogous to US medical licensing exam image questions, where analysis of both the textual context and image are often necessary to arrive at the right answer. Unlike the USMLE and much of existing literature on LLMs, we did not provide multiple choice answers to increase the difficulty and real-world applicability of the task.
As a compromise, we permitted the LLM to provide its top 3 differential and adjudicated an output as correct if the correct diagnosis was contained within this differential. Specifically, if one was an exact match or a medically equivalent synonym to the real diagnosis, the output was marked as “correct”. Although this is less rigorous than forcing a single diagnosis, we believe it to be similar to real clinical practice, in which history and exam typically result in a narrow differential, while the ultimate diagnosis requires further testing or follow-up. In the case of Google Lens, if the correct result was shown on the first page (the top 15-20 results), the output was considered correct.
Dataset
We assembled a dataset (view here) of 113 diagnosis-verified images featuring 88 conditions sourced from two dermatology atlases — DermaAmin and Atlas Dermatologico, as well as a few other online resources which specify a confirmed diagnosis. We included all 6 Fitzpatrick skin colors, varied lighting and contextual clues, and 30 rare conditions that affect fewer than 1:10,000 people. The dataset was intentionally assembled to resemble images that could be taken by patients themselves without professional equipment.
Limitations
GPT4o typically shows better descriptive and diagnostic capabilities than all other models. While the present article suggests that synergies can be observed when clinical context and dermatological images are used together, the data it contains does not prove that GPT4o uses reasoning when reading dermatological images as the images were sourced from publicly accessible online repositories. If GPT4o had directly trained on these images, then including context could just be the other piece of the puzzle needed to help it remember what it had previously “seen.” Additionally, if GPT4o knows the correct diagnosis from its training data, it may in theory infer the correct description of the image without truly being able to extract its characteristics.
Based on trials of images not available on the internet, we believe our conclusions in this article hold. Furthermore, unlike Google Lens, GPT4o tends to make mistakes when a phenomenon does not look similar but could be described in a similar way. For example, a long streak running down someone’s entire forearm (dermatographic urticaria) was described as a “linear, serpiginous reddish rash” and diagnosed in a few trials as larva migrans and scabies (both parasitic conditions that can lead to small burrows visible on the skin) even though scale-wise these diagnoses are not possible.