To examine whether natural language processing (NLP) using a clinically based large language model (LLM) could be used to predict patient selection for total hip or total knee arthroplasty (THA/TKA) from routinely available free-text radiology reports. Data pre-processing and analyses were conducted according to the Artificial intelligence to Revolutionize the patient Care pathway in Hip and knEe aRthroplastY (ARCHERY) project protocol. This included use of de-identified Scottish regional clinical data of patients referred for consideration of THA/TKA, held in a secure data environment designed for artificial intelligence (AI) inference. Only preoperative radiology reports were included. NLP algorithms were based on the freely available GatorTron model, a LLM trained on over 82 billion words of de-identified clinical text. Two inference tasks were performed: assessment after model-fine tuning (50 Epochs and three cycles of k-fold cross validation), and external validation.Aims
Methods
To examine whether Natural Language Processing (NLP) using a state-of-the-art clinically based Large Language Model (LLM) could predict patient selection for Total Hip Arthroplasty (THA), across a range of routinely available clinical text sources. Data pre-processing and analyses were conducted according to the Ai to Revolutionise the patient Care pathway in Hip and Knee arthroplasty (ARCHERY) project protocol ( There were 3911, 1621 and 1503 patient text documents included from the sources of referral letters, radiology reports and clinic letters respectively. All letter sources displayed significant class imbalance, with only 15.8%, 24.9%, and 5.9% of patients linked to the respective text source documentation having undergone surgery. Untrained model performance was poor, with F1 scores (harmonic mean of precision and recall) of 0.02, 0.38 and 0.09 respectively. This did however improve with model training, with mean scores (range) of 0.39 (0.31–0.47), 0.57 (0.48–0.63) and 0.32 (0.28–0.39) across the 5 folds of cross-validation. Performance deteriorated on external validation across all three groups but remained highest for the radiology report cohort. Even with further training on a large cohort of routinely collected free-text data a clinical LLM fails to adequately perform clinical inference in NLP tasks regarding identification of those selected to undergo THA. This likely relates to the complexity and heterogeneity of free-text information and the way that patients are determined to be surgical candidates.
There is increasing popularity in the use of artificial intelligence and machine-learning techniques to provide diagnostic and prognostic models for various aspects of Trauma & Orthopaedic surgery. However, correct interpretation of these models is difficult for those without specific knowledge of computing or health data science methodology. Lack of current reporting standards leads to the potential for significant heterogeneity in the design and quality of published studies. We provide an overview of machine-learning techniques for the lay individual, including key terminology and best practice reporting guidelines. Cite this article: