Rizkala T., Muench N., Hassan C., Dinis-Ribeiro M., Generative AI Working Group .

BACKGROUND: This study assessed the effectiveness of large language models (LLMs) in generating lay summaries for patient education on the management of precancerous lesions and early neoplasia in the stomach. METHODS: In this pilot study, we used a two-period, crossover, blinded design to compare a ChatGPT-4o summary versus a Digestive Cancers Europe (DiCE) summary. Two panels rated the materials: expert physicians and DiCE Patient Advisory Committee members. Experts scored accuracy, completeness, comprehensibility, and satisfaction (across five sections); patients rated overall completeness, comprehensibility, and satisfaction. Paired comparisons used mixed-effects estimates. Readability was assessed with Flesch-Kincaid grade level (FKGL) and SMOG index. RESULTS: Median expert ratings were similar between materials across metrics. For the overall summary, median (range; IQR) scores were: accuracy 5 (4-6; 1) for ChatGPT-4o vs. 5 (3-6; 1) for DiCE (P = 0.10); completeness 4 (3-5; 1) vs. 4 (2-5; 1; P = 0.27); comprehensibility 4 (3-5; 1) vs. 4 (2-5; 1; P = 0.33); and satisfaction 4 (2-5; 1) vs. 3 (1-5; 2; P = 0.53). Patient ratings mirrored experts, with very similar results. Readability failed to meet guideline recommendations for both summaries on both FKGL and SMOG scores. CONCLUSION: ChatGPT-4o produced patient materials comparable to DiCE, but both require readability optimization; a human-in-the-loop workflow and future tests across prompts and models are warranted.

Generative artificial intelligence for patient education material on gastric cancer prevention.

Rizkala T., Muench N., Hassan C., Dinis-Ribeiro M., Generative AI Working Group .

DOI

Type

Publication Date