Advertisement for orthosearch.org.uk
Results 1 - 1 of 1
Results per page:
Orthopaedic Proceedings
Vol. 106-B, Issue SUPP_18 | Pages 61 - 61
14 Nov 2024
Bafor A Iobst C Francis KT Strub D Kold S
Full Access

Introduction. The recent introduction of Chatbots has provided an interactive medium to answer patient questions. The accuracy of responses with these programs in limb lengthening and reconstruction surgery has not previously been determined. Therefore, the purpose of this study was to assess the accuracy of answers from 3 free AI chatbot platforms to 23 common questions regarding treatment for limb lengthening and reconstruction. Method. We generated a list of 23 common questions asked by parents before their child's limb lengthening and reconstruction surgery. Each question was posed to three different AI chatbots (ChatGPT 3.5 [OpenAI], Google Bard, and Microsoft Copilot [Bing!]) by three different answer retrievers on separate computers between November 17 and November 18, 2023. Responses were only asked one time to each chatbot by each answer retriever. Nine answers (3 answer retrievers × 3 chatbots) were randomized and platform-blinded prior to rating by three orthopedic surgeons. The 4-point rating system reported by Mika et al. was used to grade all responses. Result. ChatGPT had the best response accuracy score (RAS) with a mean score of 1.73 ± 0.88 across all three raters (range of means for all three raters – 1.62 – 1.81) and a median score of 2. The mean response accuracy scores for Google Bard and Microsoft Copilot were 2.32 ± 0.97 and 3.14 ± 0.82, respectively. This ranged from 2.10 – 2.48 and 2.86 – 3.54 for Google Bard and Microsoft Copilot, respectively. The differences between the mean RAS scores were statistically significant (p < 0.0001). The median scores for Google Bard and Microsoft Copilot were 2 and 3, respectively. Conclusion. Using the Response Accuracy Score, the responses from ChatGPT were determined to be satisfactory, requiring minimal clarification, while the responses from Microsoft Copilot were either satisfactory, requiring moderate clarification, or unsatisfactory, requiring substantial clarification