Clinical Governance
Clinical information quality is of utmost importance at MDandMe. We achieve first-class accuracy by guard-railing and selectively leveraging carefully curated large language model technology at four levels:
1. Data Gathering
Through a multi-pronged approach of relevant data collection (1.1), summary (1.2), and restructuring (1.3), we extend the base capabilities of leading foundational large language models (LLMs), including OpenAI GPTs, to perform in the real world at levels more similar to standardized exams, a well-validated strength of LLMs (see citations in section 2). This contrasts with responses from ChatGPT or a simple LLM wrapper, which tend to make conclusions prematurely and cannot benefit from a maximally complete clinical context.
1.1 Relevant data collection
The MDandMe symptom checker is a tightly guard-railed system that leverages an internal medical ontology rooted in SNOMED-CT, the global standard for clinical documentation and reporting from which we can draw on 300,000+ specific concepts. Our medical interview methodology follows Harvard Medical School guidelines, including a history of present illness, clarification questions, review of system questions, and risk factor questions that may include social history, past medical history, medications, and surgical history.
1.2 Summary
We summarize data gathered in the interview to allow for more efficient processing. In 2023, a Stanford-based study found that GPT summarization of patient histories based on patient interview transcripts was on par with senior resident physicians [1].
1.3 Restructuring
We extract structured data from conversations and carry forward key structured data from prior conversations to further improve the accuracy of our assessments. Studies, including a recent UT Southwestern analysis, demonstrate the capability of GPT to extract clinical structured data at nearly 100% accuracy in important cases [2].
2. Data assessment
Between 2023 and 2024, several rigorous peer-reviewed academic studies have shown that OpenAI GPT-4 performs similarly or superior to human physicians in multiple benchmarks, including board exam performance (2.1) and clinical scenario interpretation (2.2). While MDandMe strictly does not encourage any behavior without medical guidance, does not give concrete diagnoses, and defers to the judgment of the user's clinicians where relevant, our clinical team has evaluated that the information we provide is on par with the literature cited below.
2.1 Board exam performance
OpenAI GPT-4 has been shown in multiple peer-reviewed studies to perform on physician licensing board exams at levels comparable or stronger than human physicians in the USA [3], UK [4], Israel [5], Germany [6], and Canada [7], among other countries with similar medical standards.
2.2 Difficult clinical scenarios
Another relevant comparison of MDandMe is health forums where other users or voluntary human experts can weigh in and provide helpful information. This is a feature of the MDandMe community. However, insights provided on the community by MDandMe (as opposed to users) are of analogous quality to the use of GPT-4 in analyzing complex clinical cases, which was found to outperform 99.98% of human respondents to Case Challenges published by the New England Journal of Medicine from 2017 to 2023 [8].
3. Test result explanation
MDandMe can help users make sense of test results, including blood tests, imaging reports, and medical notes. We explain medical jargon, contextualize the rationale of the user's clinicians, and provide information about relevant clinical guidelines or standard of care procedures. These use cases are supported by numerous peer-reviewed academic papers indicating that GPT-4 provides medical information of comparable accuracy to human experts [9]. Studies conducted at Harvard hospitals also showed that GPT generates reliable and physician-level explanations for patients' frequently asked questions and outperformed attending physicians in a head-to-head comparison of explaining medical reasoning [10], [11].
Nonetheless, we acknowledge that discrepancies in test results can occur based on equipment, and certain nuances may only be known to the clinicians ordering these tests. Thus, we have guard-railed our result explanations to request lab-provided standard values wherever applicable (3.1) and will always defer to healthcare professional judgment (3.2)
3.1 Processing of external health measurements
MDandMe does not sell hardware or make health measurements within the app. Thus, we request original read-outs and standard values provided by labs or manufacturers, as health measurements can differ by equipment. If a user requests interpretation without an original document or official standard value, we can provide generic health information but cannot provide a personalized interpretation.
3.2 Deference to healthcare professionals
In addition to always suggesting that users confirm and seek advice from real healthcare professionals, MDandMe requests original reports and notes written by healthcare professionals, if applicable, to better inform its explanations. Our explanations are designed not to contradict the claims or actions of healthcare professionals involved in the care of our users.
4. Information extraction from images
MDandMe applies a safety filter to all images, assigning them to categories. Some categories, such as complex medical images (4.1), are rejected as LLMs cannot reliably interpret them. On the other hand, we leverage the optical character recognition (4.2) and dermatologic image description (4.3) capabilities of GPT-4V(ision).
4.1 Complex medical images
We acknowledge that some advanced medical images, such as radiology scans or pathology slides, are entirely uninterpretable for the lay public. In such cases, users cannot validate textual descriptions generated from the images. After surveying current literature, we determined that LLMs are not yet capable of reliably helping non-professionals interpret advanced medical images; for example, a recent study by German radiologists estimates that the accuracy of GPT-4V at reading radiological images is 47% at best [12]. Thus, images taken with medical equipment do not pass the safety filter. We notify the user that we cannot read the original image, only the imaging or pathology report.
4.2 Image of written documents
Optical character recognition (OCR) is a well-studied and robust capability of machine learning models beyond LLMs. When coupled with OCR, GPT-4 has been shown to extract data from unstructured pathology reports with a 1% error rate and high concordance with human expert work [13]. The accuracy of GPT4o OCR has been reported to be 94% for a challenging dataset, including images that are blurry, dim, or otherwise difficult to read. As we screen for the quality of text-based images before applying OCR, our text extraction accuracy is close to 100% in most cases [14].
4.3 Images of skin
Our approach to analyzing dermatological images focuses on extracting textual information from the uploaded image, allowing users to validate the accuracy of this textual description and basing our subsequent responses on the textual data rather than the original image. This ensures that the information we provide users still falls within the well-validated realm of GPT capabilities. By incorporating clinical context as previously described, early data suggests we can achieve a diagnostic accuracy of around 90% [15].
Citations
Nayak, A., Alkaitis, M. S., Nayak, K., Nikolov, M., Weinfurt, K. P., & Schulman, K. (2023). Comparison of history of present illness summaries generated by a chatbot and senior internal medicine residents. JAMA Internal Medicine, 183(9), 1026-1027. [Back to content]
Huang, J., Yang, D. M., Rong, R., Nezafati, K., Treager, C., Chi, Z., ... & Xie, Y. (2024). A critical assessment of using ChatGPT for extracting structured data from clinical notes. npj Digital Medicine, 7(1), 106. [Back to content]
Brin, D., Sorin, V., Vaid, A., Soroush, A., Glicksberg, B. S., Charney, A. W., ... & Klang, E. (2023). Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports, 13(1), 16492. [Back to content]
Sadeq, M. A., Ghorab, R. M. F., Ashry, M. H., Abozaid, A. M., Banihani, H. A., Salem, M., ... & Moawad, M. H. E. D. (2024). AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study. Scientific Reports, 14(1), 18859. [Back to content]
Katz, U., Cohen, E., Shachar, E., Somer, J., Fink, A., Morse, E., ... & Wolf, I. (2024). GPT versus Resident Physicians—A Benchmark Based on Official Board Scores. NEJM AI, 1(5), AIdbp2300192. [Back to content]
Meyer, A., Riese, J., & Streichert, T. (2024). Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Medical Education, 10, e50965. [Back to content]
Mousavi, M., Shafiee, S., Harley, J. M., Cheung, J. C. K., & Rahimi, S. A. (2024). Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada. Family Medicine and Community Health, 12(Suppl 1). [Back to content]
Brin, D., Sorin, V., Vaid, A., Soroush, A., Glicksberg, B. S., Charney, A. W., ... & Klang, E. (2023). Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports, 13(1), 16492. [Back to content]
Jo, E., Song, S., Kim, J. H., Lim, S., Kim, J. H., Cha, J. J., ... & Joo, H. J. (2024). Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts. JMIR Medical Education, 10(1), e51282. [Back to content]
Roldan‐Vasquez, E., Mitri, S., Bhasin, S., Bharani, T., Capasso, K., Haslinger, M., ... & James, T. A. (2024). Reliability of artificial intelligence chatbot responses to frequently asked questions in breast surgical oncology. Journal of Surgical Oncology. [Back to content]
Jacqueline Mitchell, Beth Israel Medical Center. Chatbot Outperformed Physicians in Clinical Reasoning in Head-To-Head Study. April 1, 2024. https://www.bidmc.org/about-bidmc/news/2024/04/chatbot-outperformed-physicians-in-clinical-reasoning-in-head-to-head-study [Back to content]
Busch, F., Han, T., Makowski, M. R., Truhn, D., Bressem, K. K., & Adams, L. (2024). Integrating Text and Image Analysis: Exploring GPT-4V's Capabilities in Advanced Radiological Applications Across Subspecialties. Journal of Medical Internet Research, 26, e54948. [Back to content]
Roboflow. Best OCR Models for Text Recognition in Images. March 16, 2024. https://blog.roboflow.com/best-ocr-models-text-recognition/ [Back to content]
Truhn, D., Loeffler, C. M., Müller‐Franzes, G., Nebelung, S., Hewitt, K. J., Brandner, S., ... & Kather, J. N. (2024). Extracting structured information from unstructured histopathology reports using generative pre‐trained transformer 4 (GPT‐4). The Journal of Pathology, 262(3), 310-319. [Back to content]
Pillai, A., Parappally-Joseph, S., & Hardin, J. (2024). Evaluating the Diagnostic and Treatment Recommendation Capabilities of GPT-4 Vision in Dermatology. medRxiv, 2024-01. [Back to content]