Advertisement for orthosearch.org.uk
Orthopaedic Proceedings Logo

Receive monthly Table of Contents alerts from Orthopaedic Proceedings

Comprehensive article alerts can be set up and managed through your account settings

View my account settings

Visit Orthopaedic Proceedings at:

Loading...

Loading...

Full Access

Research

CHATBOTS IN LIMB LENGTHENING AND RECONSTRUCTION SURGERY

The European Orthopaedic Research Society (EORS) 32nd Annual Meeting, Aalborg, Denmark, 18–20 September 2024.



Abstract

Introduction

The recent introduction of Chatbots has provided an interactive medium to answer patient questions. The accuracy of responses with these programs in limb lengthening and reconstruction surgery has not previously been determined. Therefore, the purpose of this study was to assess the accuracy of answers from 3 free AI chatbot platforms to 23 common questions regarding treatment for limb lengthening and reconstruction.

Method

We generated a list of 23 common questions asked by parents before their child's limb lengthening and reconstruction surgery. Each question was posed to three different AI chatbots (ChatGPT 3.5 [OpenAI], Google Bard, and Microsoft Copilot [Bing!]) by three different answer retrievers on separate computers between November 17 and November 18, 2023. Responses were only asked one time to each chatbot by each answer retriever. Nine answers (3 answer retrievers × 3 chatbots) were randomized and platform-blinded prior to rating by three orthopedic surgeons. The 4-point rating system reported by Mika et al. was used to grade all responses.

Result

ChatGPT had the best response accuracy score (RAS) with a mean score of 1.73 ± 0.88 across all three raters (range of means for all three raters – 1.62 – 1.81) and a median score of 2. The mean response accuracy scores for Google Bard and Microsoft Copilot were 2.32 ± 0.97 and 3.14 ± 0.82, respectively. This ranged from 2.10 – 2.48 and 2.86 – 3.54 for Google Bard and Microsoft Copilot, respectively. The differences between the mean RAS scores were statistically significant (p < 0.0001). The median scores for Google Bard and Microsoft Copilot were 2 and 3, respectively.

Conclusion

Using the Response Accuracy Score, the responses from ChatGPT were determined to be satisfactory, requiring minimal clarification, while the responses from Microsoft Copilot were either satisfactory, requiring moderate clarification, or unsatisfactory, requiring substantial clarification.


Corresponding author: Christopher Iobst