Advertisement for orthosearch.org.uk
Bone & Joint Open Logo

Receive monthly Table of Contents alerts from Bone & Joint Open

Comprehensive article alerts can be set up and managed through your account settings

View my account settings

Visit Bone & Joint Open at:

Loading...

Loading...

Open Access

Systematic Review

A systematic review of natural language processing applications in Trauma & Orthopaedics



Download PDF

Abstract

Aims

Prevalence of artificial intelligence (AI) algorithms within the Trauma & Orthopaedics (T&O) literature has greatly increased over the last ten years. One increasingly explored aspect of AI is the automated interpretation of free-text data often prevalent in electronic medical records (known as natural language processing (NLP)). We set out to review the current evidence for applications of NLP methodology in T&O, including assessment of study design and reporting.

Methods

MEDLINE, Allied and Complementary Medicine (AMED), Excerpta Medica Database (EMBASE), and Cochrane Central Register of Controlled Trials (CENTRAL) were screened for studies pertaining to NLP in T&O from database inception to 31 December 2023. An additional grey literature search was performed. NLP quality assessment followed the criteria outlined by Farrow et al in 2021 with two independent reviewers (classification as absent, incomplete, or complete). Reporting was performed according to the Synthesis-Without Meta-Analysis (SWiM) guidelines. The review protocol was registered on the Prospective Register of Systematic Reviews (PROSPERO; registration no. CRD42022291714).

Results

The final review included 31 articles (published between 2012 and 2021). The most common subspeciality areas included trauma, arthroplasty, and spine; 13% (4/31) related to online reviews/social media, 42% (13/31) to clinical notes/operation notes, 42% (13/31) to radiology reports, and 3% (1/31) to systematic review. According to the reporting criteria, 16% (5/31) were considered good quality, 74% (23/31) average quality, and 6% (2/31) poor quality. The most commonly absent reporting criteria were evaluation of missing data (26/31), sample size calculation (31/31), and external validation of the study results (29/31 papers). Code and data availability were also poorly documented in most studies.

Conclusion

Application of NLP is becoming increasingly common in T&O; however, published article quality is mixed, with few high-quality studies. There are key consistent deficiencies in published work relating to NLP which ultimately influence the potential for clinical application. Open science is an important part of research transparency that should be encouraged in NLP algorithm development and reporting.

Cite this article: Bone Jt Open 2025;6(3):264–274.

Take home message

This study highlights some of the key potential uses of natural language processing in Trauma & Orthopaedics.

It also identifies some methodological concerns with the currently available literature on the subject.

Introduction

There has been a massive influx of publications regarding artificial intelligence (AI) applications in the domain of Trauma & Orthopaedics (T&O).1 One AI technique is natural language processing (NLP), which enables processing and analysis of large amounts of natural language or 'free-text' (for example, written information contained within a clinical letter) data.

It is estimated that approximately 80% of healthcare data are in an unstructured or 'free-text' format.2 These data have the potential to provide a veritable wealth of useful information to guide clinical practice and research. NLP allows users to turn these unstructured data into meaningful material for analysis.

NLP is not without its challenges, in particular risk concerning potential identification of protected healthcare information contained within the free-text resource. Techniques such as 'Hidden In Plain Sight' (HIPS) methods have been developed to attempt to maintain free-text structure while ensuring anonymity,3 but this in itself requires dedicated health data science infrastructure. Ethical concerns have also been raised about granting access to large volumes of anonymized free-text healthcare data without consent, although previous evidence has suggested that this is supported if particular safeguarding structures are in place.4

Despite these challenges, there has been evidence of successful use of NLP applications within the healthcare setting. Examples include delirium detection in the intensive care unit,5 surveillance of patients at high risk of upper gastrointestinal cancer,6 and predicting outcomes of critical care patients.7

Development of NLP applications has been reported within T&O, such as development of an arthroplasty database8 and fracture identification.9 No study to date has, however, methodically assessed the available NLP literature, including an evaluation of study quality and analysis of reported performance metrics. We therefore set out to perform a systematic review of NLP applications within T&O to better appraise current applications and guide future use.

Methods

This systematic review was performed and reported according to the PRISMA statement.10 Registration prior to study commencement was undertaken on the Prospective Register of Systematic Reviews (PROSPERO) no. CRD42022291714.

Search strategy

Relevant articles were identified through a search of MEDLINE, Allied and Complementary Medicine Database (AMED), Excerpta Medica Database (EMBASE), and Cochrane Central Register of Controlled Trials (CENTRAL). An additional search of the grey literature was also undertaken using OrthoSearch (an orthopaedic-specific database which contains abstracts, articles, and associated media information).11 All electronic searches were undertaken from database inception to 31 December 2023. Full electronic search terms are shown in Supplementary Table i. Reference lists from all extracted studies were reviewed for potentially eligible manuscripts.

Eligibility criteria

All studies that involved research related to the use of NLP in the setting of T&O and associated subspecialities were included. Exclusion criteria were studies involving other surgical or medical specialities, use of other AI techniques that were not specifically identified as NLP, publications in relation to generative AI, and non-English language publications.

Study identification

Two assessors (LF, AR) independently screened search output titles and abstracts for articles which met the eligibility criteria. Full-text review was undertaken to determine eligibility.

Data extraction

Data extraction was undertaken using a prespecified proforma by two independent assessors (LF, AR). Fields included: 1) Design overview: author, year, subspeciality, and NLP domain (e.g. online reviews/social media, or clinical/operation notes); 2) Introduction reporting: study aims; 3) Methods reporting: data source, data quality, data pre-processing, missing data, testing/training/internal validation, external validation, model type, and sample size calculation; 4) Results reporting: sample reporting, performance metrics, model evaluation, and model explanation; 5) Conclusions reporting: clinical practice interpretation, limitations, and future research; and 6) Open science: code and data availability.

Quality assessment

To our knowledge, there are no current globally defined reporting guidelines that relate specifically to NLP. We therefore used assessment of compliance to the reporting guidelines outlined by Farrow et al,1 with each domain categorized as either complete, incomplete, or absent. Code and data availability were assessed separately. The reporting guidelines were chosen due to their specific relation to AI applications in T&O, with inclusion of reporting quality across several domains for the introduction, methods, results, and conclusions separately. An overall cumulative score (total/34) was derived from score tertiles to allow for better interpretability of the final score. Manuscripts with scores < 11, 11 to 22, and 23 to 34 were deemed poor, average, and good quality, respectively. Any disagreement regarding individual scoring of domains between data extractors was resolved by discussion.

Pooled performance metrics

Where feasible, according to study reporting, pooled performance metrics (mean and range) were assessed. This included: model accuracy; sensitivity (recall); specificity; precision (positive predictive value); area under the receiver operating curve (AUROC); F1 score; and calibration. Where scores for multiple cohorts were reported the highest performing model output was chosen for inclusion. All scores were defined by the individual study authors, including decisions around ground-truth labels.

Statistical analysis

Given the nature of the included data, meta-analysis was not feasible and therefore reporting was performed according to the Synthesis Without Meta-Analysis (SWiM) criteria.12 Studies have been grouped by NLP domain, with assessment of study heterogeneity and evidence certainty determined by the variability of study validity/bias within each domain.

Results

Research results

Using the pre-specified search criteria, 602 potentially eligible records were included. Following full-text assessment, 36 manuscripts were included.8,9,13-46Figure 1 depicts the flow diagram of the full search process. The number of articles published per year increased from one between 2012 to 2017 to a peak of 13 in 2021 alone.

Fig. 1 
            Study selection process. NLP, natural language processing.

Fig. 1

Study selection process. NLP, natural language processing.

Characteristics of included studies

Study characteristics, incorporating the quality assessment scoring for each manuscript, are detailed in Table I.

Table I.

Summary of included studies including reporting assessment.

Design overview Introduction reporting Methods reporting Results reporting Conclusions reporting Open science
first author Year Sub-speciality NLP domain Study aims Data source Data quality Data pre-processing Missing data Test, train, and validation methods External validation ML Output Sample size calculation Sample reporting Model evaluation Model explanation Clinical practice interpretation Limitations Future research Code availability Data availability Overall (/34)
Shah 2020 Arthroplasty Clinical notes/operation notes 1 2 0 2 0 2 0 2 0 0 2 0 2 2 2 0 0 17
Mohammadi 2020 Arthroplasty Clinical notes/operation notes 2 2 0 2 1 2 0 2 0 2 2 0 2 2 2 0 1 21
Blaker 2021 Trauma Clinical notes/operation notes 2 2 2 1 1 1 0 1 2 1 1 0 2 1 0 0 0 15
Karhade 2020 Spine Clinical notes/operation notes 2 2 1 2 0 2 0 2 0 1 2 2 2 2 1 0 0 21
Karhade 2020 Spine Clinical notes/operation notes 2 2 2 2 1 2 0 1 0 2 2 2 1 2 2 0 0 23
Sagheb 2021 Arthroplasty Clinical notes/operation notes 1 2 1 2 0 1 0 1 0 2 1 2 1 2 1 0 0 17
Wyles 2019 Arthroplasty Clinical notes/operation notes 2 2 2 2 0 0 2 1 0 2 1 1 2 2 1 2 0 22
Tibbo 2019 Trauma Clinical notes/operation notes 2 2 2 2 0 1 0 2 0 1 1 0 2 2 1 0 0 18
Fu 2021 Arthroplasty Clinical notes/operation notes 2 2 2 2 0 2 0 2 0 1 2 0 2 2 1 0 0 20
Karhade 2021 Spine Clinical notes/operation notes 2 2 0 2 0 0 0 1 0 1 2 2 2 2 2 0 0 17
Thirukumaran 2019 General Clinical notes/operation notes 2 2 2 2 0 2 0 1 0 2 2 2 2 2 2 0 0 23
Borjali 2021 Arthroplasty Clinical notes/operation notes 2 2 2 2 0 2 0 2 0 0 1 1 2 2 1 0 0 19
Karhade 2020 Spine Clinical notes/operation notes 2 2 2 2 0 2 0 1 0 2 2 2 2 2 1 0 0 22
Wyles 2022 Arthroplasty Clinical notes/operation notes 1 2 2 2 1 1 2 2 0 1 1 0 2 2 1 0 0 20
Karhade 2022 Spine Clinical notes/operation notes 1 2 2 2 0 2 2 2 0 1 2 2 2 2 1 0 0 23
Flores-Balado 2023 Arthroplasty Clinical notes/Operation notes 1 2 1 1 0 1 1 2 0 1 1 2 1 1 0 0 0 15
Tavabi 2022 Sports Clinical notes/operation notes 1 2 1 1 1 2 1 2 0 0 1 1 1 1 1 0 0 16
Kita 2022 Arthroplasty Clinical notes/operation notes 2 2 1 2 0 2 1 1 0 1 1 1 1 2 1 0 0 18
Langerhuizen 2021 General Online reviews/social media 2 1 0 2 1 1 0 1 0 1 0 0 1 1 0 0 2 13
Bovonratwet 2021 Arthroplasty Online reviews/social Media 2 2 1 1 0 0 0 1 0 2 1 0 2 1 0 0 0 13
Menendez 2019 Shoulder and Elbow Online reviews/social media 2 2 0 0 0 0 0 1 0 2 0 0 2 1 0 0 0 10
Dominy 2021 Spine Online reviews/social media 2 1 0 1 0 0 0 1 0 1 1 0 2 1 0 0 0 10
Groot 2020 Tumour Radiology reports for feature detection/classification 1 2 0 2 2 2 0 2 0 2 2 2 2 2 2 0 0 23
dos Santos 2019 Foot and Ankle Radiology reports for feature detection/classification 2 2 0 1 0 0 0 1 0 2 2 0 2 2 0 0 1 16
Wang 2018 Trauma Radiology reports for feature detection/classification 1 2 2 2 0 2 0 1 0 0 2 0 1 1 1 0 0 16
Wagholikar 2013 Trauma Radiology reports for feature detection/classification 2 1 2 1 0 0 0 2 0 0 2 2 2 1 2 0 0 17
Grundmeier 2016 Trauma Radiology reports for feature detection/classification 2 2 2 2 0 2 0 2 0 1 2 2 2 2 1 0 0 22
Do 2012 Trauma Radiology reports for feature detection/classification 2 1 0 2 0 1 0 1 0 0 2 0 2 2 2 0 0 15
Kolanu 2021 Trauma Radiology reports for feature detection/classification 2 2 1 2 0 1 2 2 0 1 2 1 2 2 1 0 0 21
Olthof 2021 Trauma Radiology reports for feature detection/classification 2 2 2 2 0 2 0 2 0 1 2 2 1 1 0 0 0 19
Foufi 2018 Trauma Radiology reports for feature detection/classification 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 8
Galbusera 2021 Spine Radiology reports for feature detection/classification 2 2 2 2 0 1 0 2 0 1 2 1 1 2 0 0 0 18
Jungman 2021 Trauma Radiology reports for feature detection/classification 2 2 2 2 0 1 0 2 0 2 2 0 2 2 0 0 0 19
Tan 2018 Spine Radiology reports for feature detection/classification 1 2 2 2 0 2 0 2 0 1 2 2 2 2 1 0 0 21
Huhdanpaa 2017 Spine Radiology reports for feature detection/classification 2 2 2 1 0 2 0 2 0 2 2 1 1 1 0 0 0 14
Buchlak 2021 Arthroplasty Systematic reviews 2 2 1 2 0 0 0 2 0 2 1 2 1 2 2 1 2 23
  1. For the reporting assessment: 0 indicates domain absent from the manuscript, 1 indicates partial completion, and 2 indicates full completion. Total score across 17 domains = 34 points. Reporting criteria adapted from Farrow et al.1

Of the included articles, ten related to trauma, ten to arthroplasty, nine to spinal surgery, three to general orthopaedics, one to foot and ankle surgery, one to shoulder and elbow surgery, one to sports surgery, and one to tumour surgery.

With regards to NLP domains, the most commonly used were clinical or operation notes (50%) and radiology reports (36% each). Use in assessment of online reviews/social media and systematic reviews were less common (11% and 3%, respectively).

Overall assessment of study reporting

Of the 36 included studies, the median quality score was 18/34 (IQR 16 to 21); 11% were categorized as good quality, 83% average quality, and 6% poor quality. The most common incomplete study reporting fields were evaluation of missing data, external validation, and a sample size calculation. The top three most frequently completed reporting criteria were study aims, data source, and data pre-processing. Figure 2 demonstrates the bar plot of overall study reporting outcomes.

Fig. 2 
            Summary of overall results.

Fig. 2

Summary of overall results.

Reporting domains

Full details of the reporting domains for each individual study are demonstrated in Table I. These are taken from the study by Farrow et al.1

Introduction reporting: 26/36 (72% of included studies) had clear documentation of the study aims, with the remainder having at least partial completion.

Methods reporting: All studies at least partially identified their data source, with only 5/36 (14%) providing no details regarding quality of the supplied data. A total of 25/36 (92%) studies fully indicated the preprocessing steps undertaken prior to model training and testing, with only one study providing no preprocessing information. Both missing data and external validation were, however, poorly documented in the majority of studies, with this domain absent from 29/36 (81%) for both fields, respectively. Overall, 9/36 (25%) studies did not provide indication of their testing, training, and validation methods. All studies at least partially reported the type of NLP algorithm output. Only one study provided any form of sample size calculation for model development.

Results reporting: Reporting regarding the sample population was fully performed in 13/36 (36%), with model evaluation fully performed in 21/36 (58%). In all, 15/36 (42%) cases did not provide any reference to explainability of the developed model.

Conclusions reporting: All studies made some reference to potential clinical practice interpretation, with the vast majority (35/36; 97%) describing the study limitations. A total of 11/36 (31%) did not provide any reference regarding requirements for potential future research in their manuscript.

Open Science: Only one study provided the code for algorithm development and testing, with two studies providing the data in an open-source forum.

NLP domain: clinical notes/operation notes

Of the identified studies, 18 related to NLP analysis of clinical or operation notes,8,14,18-20,24,25,28,29,31,35,38,39,42-46 nine studies related to arthroplasty, five to spinal surgery, two to trauma, one to general orthopaedics, and one to sports surgery. The most common application was to identify adverse outcomes, for example re-admission or surgical complications. Automated database/registry creation was also featured. The median quality assessment for studies in this domain was 19/34 (IQR 17 to 22); 17% were considered good quality and 83% average quality. No study relating to clinical notes or operation notes was identified as poor quality.

NLP domain: radiology reports for feature detection/classification

Several studies (n = 13) related to application of NLP to radiology reports for feature detection and classification.9,15-17,21-23,30,32,34,37,40,41 Eight related to trauma, three to spinal surgery, one to foot and ankle, and one to tumour. The most common application was the identification of presence or absence of a fracture (± classification). Median quality assessment was 18/34 (IQR 16 to 21); 8% were considered good quality, 84% average quality, and 8% poor quality.

NLP domains: online reviews/social media and systematic review

Four online reviews/social media reports were included,13,26,27,33 with one study concerning the use of NLP to perform a systematic review (evaluating arthroplasty).36 One each of the online reviews/social media studies related to general orthopaedics, arthroplasty, shoulder and elbow surgery, and spinal surgery. The main application of NLP to online reviews/social media was automated assessment of the patient experience/feedback using sentiment analysis. Median quality assessment was 12/30 (IQR 10 to 13); 20% were considered good quality, 40% average quality, and 20% poor quality.

Pooled performance metrics

In all, 20/36 studies (56%) reported at least one performance metric outcome. No single study reported results across all the domains assessed. Only five studies reported model calibration. The mean (range) performance metric outcomes for included studies (where reported) are detailed in Table II.

Table II.

Reported performance metrics.

Study details Accuracy Sensitivity (recall) Specificity Precision (positive predictive value) Area under the receiver operating curve F1 score Calibration
Shah et al8 0.94
Mohammadi et al14 0.79 0.27 0.82
Groot et al15 0.94 0.82 0.97 0.97 0.96 0.73
Dos Santos et al16 0.77 0.63 1.00 1.00 0.85
Wang et al17 0.93 1.00 1.00
Blaker et al18 0.77
Karhade et al39 0.89 0.99 0.89 0.99 0.89 1.17
Karhade et al39 0.86 0.93 0.51 0.92 0.64 0.61
Wagholikar et al21 0.92
Grundmeier et al22 0.95 0.97 0.92 0.95
Do et al23 0.79 0.90 0.95 0.90
Sagheb et al24 0.98 1.00 1.00 1.00
Wyles et al25 0.99
Tibbo et al28 1.00 1.00
Fu et al29 0.89 0.99 1.00 0.91
Kolanu et al9 0.99 1.00 0.97
Olthof et al30 0.96 0.95 0.98 0.99 0.95
Karhade et al20 0.70 1.54
Foufi et al32 0.97
Galbusera et al34 0.98 0.95 0.99 0.95
Thirukumaran et al35 0.97 0.97 0.96 0.97
Buchlak et al36 0.68
Jungman et al37 0.81 0.83 0.82
Borjali et al38 1.00 1.00
Karhade et al39 0.94 1.00 0.83
Tan et al40 0.94 0.95 0.98
Huhdanpaa et al41 0.70 0.99 0.90 0.79
Wyles et al42 1.00
Karhade et al31 0.83 0.98 0.79 0.95 0.81 3.08
Flores-Balado et al44 0.99 0.91 0.19 0.99 0.32
Tavabi et al45 1.00 1.00 1.00 1.00
Kita et al46 1.00 1.00 0.99
Mean values 0.94 0.91 0.97 0.84 0.91 0.86 1.43

Discussion

The application of NLP to T&O represents a significant opportunity to use the vast quantities of unstructured free-text data generated from routine healthcare interactions, for example in providing summaries of electronic health records or automated analysis of radiology reports. We identified three key domains of current NLP use: clinical/operation notes; radiology reports; and social media/online review posts. Reported performance measure outcomes were almost universally positive (average scores > 80% across all domains); however, there were relatively few high-quality studies identified according to the used reporting criteria. The most problematic areas related to reporting of missing data assessment, external validation, and sample size calculation. Many studies also failed to share the code used as part of the NLP algorithms and report data availability, in accordance with open science principles. Development and widespread use of specific reporting standards related to the application of NLP to healthcare is essential to the appropriate development and reporting of future work in this area.

Our study is, to our knowledge, the first systematic review to focus on applications of NLP in relation to T&O. The results are consistent with reviews of NLP applications in other fields. For example, Davidson et al47 examined NLP applications in radiology and identified that the key reporting domains that were poorly represented in studies were external validation, data availability, and code availability. The domains of missing data assessment and sample size assessment were not part of the reporting criteria used in that study, but are areas of critical importance to the correct application of NLP techniques for data analysis. It should be noted that currently, despite high-impact publications governing sample calculations for other aspects of AI inference,48 there are currently no peer-reviewed published guidelines regarding calculation of the optimum sample size for NLP development. This is likely to depend significantly on the NLP approach (for example, large language model (LLM) development/fine-tuning vs a rule-based algorithm), and should be a key research priority moving forwards.

Other applications of AI to T&O appear to suffer from similar issues when considering study reporting. Dijkstra et al49 evaluated 45 machine learning (ML)-based prediction models and identified that the risk of bias (according to the Prediction model Risk of Bias Assessment Tool, (PROBAST) tool)50 was high across the majority of included studies, with documented issues around small sample sizes, inadequate management of missing data, and lack of appropriate study reporting.

It therefore appears that the key methodological issues around study design and reporting are consistent across AI applications within T&O. The importance of model calibration appears to be particularly underappreciated, which is likely impacted by limited understanding of AI terminology and interpretation by orthopaedic surgeons.51 There is a need for a unified and collaborative approach encompassing all key stakeholders (clinicians, data scientists, statisticians, patients, providers) to maximize future applicability. Use of a development and deployment structure is integral to this process and to realizing the potential of NLP applications in the field of T&O.

Limitations of our study include the wide spectrum of different NLP approaches ranging from simple rule-based methods to LLMs. This makes a focused assessment challenging due to the heterogeneity of how these methods are typically applied and reported. Given the lack of currently available validated reporting criteria related to NLP, we used a published non-specific checklist that may be limited in some methodological domains and categorization accuracy. This study does, however, provide the first structured assessment of current applications of NLP within the T&O literature, which provides an understanding of some of the current limitations and subsequent lack of progress towards real-world implementation. It also highlights typical key deficiencies in reporting that can guide improvements in future NLP research.

In conclusion, NLP techniques have significant potential to revolutionize current approaches to data analysis, allowing use and assessment of vast quantities of unstructured free-text data that were previously a largely untapped resource. There are, however, several issues with study design and reporting that must be addressed to realize the potential for clinical practice integration. Appreciation of the importance of model calibration remains low. Sharing of code and data (where feasible) should become part of routine practice in order to maximize transparency in keeping with open science principles.


Correspondence should be sent to Luke Farrow. E-mail:

References

1. Farrow L , Zhong M , Ashcroft GP , Anderson L , Meek RMD . Interpretation and reporting of predictive or diagnostic machine-learning research in Trauma & Orthopaedics . Bone Joint J . 2021 ; 103-B ( 12 ): 1754 1758 . Crossref PubMed Google Scholar

2. Bitran H . From free text to FHIR: text analytics for health launches new feature to boost interoperability . 2022 . https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/from-free-text-to-fhir-text-analytics-for-health-launches-new/ba-p/3257066 ( date last accessed 10 February 2025 ). Google Scholar

3. Farrow L , Wilde K , Dymiter J , et al. Use of “hidden in plain sight” de-identification methodology in electronic healthcare data provides minimal risk of misidentification: results from the icaird safe haven artificial intelligence platform . Int J Popul Data Sci . 2022 ; 25 ( 3 ): 2023 . Google Scholar

4. Ford E , Oswald M , Hassan L , Bozentko K , Nenadic G , Cassell J . Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK . J Med Ethics . 2020 ; 46 ( 6 ): 367 377 . Crossref PubMed Google Scholar

5. Young M , Holmes NE , Kishore K , et al. Natural language processing diagnosed behavioural disturbance phenotypes in the intensive care unit: characteristics, prevalence, trajectory, treatment, and outcomes . Crit Care . 2023 ; 27 ( 1 ): 425 . Crossref PubMed Google Scholar

6. Li J , Hu S , Shi C , et al. A deep learning and natural language processing-based system for automatic identification and surveillance of high-risk patients undergoing upper endoscopy: a multicenter study . EClinMed . 2022 ; 53 : 101704 . Crossref PubMed Google Scholar

7. Marafino BJ , Park M , Davies JM , et al. Validation of prediction models for critical care outcomes using natural language processing of electronic health record data . JAMA Netw Open . 2018 ; 1 ( 8 ): e185097 . Crossref PubMed Google Scholar

8. Shah RF , Bini S , Vail T . Data for registry and quality review can be retrospectively collected using natural language processing from unstructured charts of arthroplasty patients . Bone Joint J . 2020 ; 102-B ( 7_Supple_B ): 99 104 . Crossref PubMed Google Scholar

9. Kolanu N , Brown AS , Beech A , Center JR , White CP . Natural language processing of radiology reports for the identification of patients with fracture . Arch Osteoporos . 2021 ; 16 ( 1 ): 6 . Crossref PubMed Google Scholar

10. Moher D , Liberati A , Tetzlaff J , Altman DG , PRISMA Group . Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement . PLoS Med . 2009 ; 6 ( 7 ): e1000097 . Crossref PubMed Google Scholar

11. No authors listed . OrthoSearch . https://orthosearch.org.uk ( date last accessed 10 February 2025 ). Google Scholar

12. Campbell M , McKenzie JE , Sowden A , et al. Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline . BMJ . 2020 ; 368 : l6890 . Crossref PubMed Google Scholar

13. Langerhuizen DWG , Brown LE , Doornberg JN , Ring D , Kerkhoffs GMMJ , Janssen SJ . Analysis of online reviews of orthopaedic surgeons and orthopaedic practices using natural language processing . J Am Acad Orthop Surg . 2021 ; 29 ( 8 ): 337 344 . Crossref PubMed Google Scholar

14. Mohammadi R , Jain S , Namin AT , et al. Predicting unplanned readmissions following a hip or knee arthroplasty: retrospective observational study . JMIR Med Inform . 2020 ; 8 ( 11 ): e19761 . Crossref PubMed Google Scholar

15. Groot OQ , Bongers MER , Karhade AV , et al. Natural language processing for automated quantification of bone metastases reported in free-text bone scintigraphy reports . Acta Oncol . 2020 ; 59 ( 12 ): 1455 1460 . Crossref PubMed Google Scholar

16. Pinto Dos Santos D , Brodehl S , Baeßler B , et al. Structured report data can be used to develop deep learning algorithms: a proof of concept in ankle radiographs . Insights Imaging . 2019 ; 10 ( 1 ): 93 . Crossref PubMed Google Scholar

17. Wang Y , Mehrabi S , Sohn S , Atkinson EJ , Amin S , Liu H . Natural language processing of radiology reports for identification of skeletal site-specific fractures . BMC Med Inform Decis Mak . 2019 ; 19 ( Suppl 3 ): 73 . Crossref PubMed Google Scholar

18. Blaker K , Wijewardene A , White E , et al. Electronic search programs are effective in identifying patients with minimal trauma fractures . Osteoporos Int . 2022 ; 33 ( 2 ): 435 441 . Crossref PubMed Google Scholar

19. Karhade AV , Bongers MER , Groot OQ , et al. Natural language processing for automated detection of incidental durotomy . Spine J . 2020 ; 20 ( 5 ): 695 700 . Crossref PubMed Google Scholar

20. Karhade AV , Bongers MER , Groot OQ , et al. Development of machine learning and natural language processing algorithms for preoperative prediction and automated identification of intraoperative vascular injury in anterior lumbar spine surgery . Spine J . 2021 ; 21 ( 10 ): 1635 1642 . Crossref PubMed Google Scholar

21. Wagholikar A , Zuccon G , Nguyen A , et al. Automated classification of limb fractures from free-text radiology reports using a clinician-informed gazetteer methodology . Australas Med J . 2013 ; 6 ( 5 ): 301 307 . Crossref PubMed Google Scholar

22. Grundmeier RW , Masino AJ , Casper TC , et al. Identification of Long Bone Fractures in Radiology Reports Using Natural Language Processing to support Healthcare Quality Improvement . Appl Clin Inform . 2016 ; 7 ( 4 ): 1051 1068 . Crossref PubMed Google Scholar

23. Do BH , Wu AS , Maley J , Biswal S . Automatic retrieval of bone fracture knowledge using natural language processing . J Digit Imaging . 2013 ; 26 ( 4 ): 709 713 . Crossref PubMed Google Scholar

24. Sagheb E , Ramazanian T , Tafti AP , et al. Use of natural language processing algorithms to identify common data elements in operative notes for knee arthroplasty . J Arthroplasty . 2021 ; 36 ( 3 ): 922 926 . Crossref PubMed Google Scholar

25. Wyles CC , Tibbo ME , Fu S , et al. Use of natural language processing algorithms to identify common data elements in operative notes for total hip arthroplasty . J Bone Joint Surg Am . 2019 ; 101-A ( 21 ): 1931 1938 . Crossref PubMed Google Scholar

26. Bovonratwet P , Shen TS , Islam W , Ast MP , Haas SB , Su EP . Natural language processing of patient-experience comments after primary total knee arthroplasty . J Arthroplasty . 2021 ; 36 ( 3 ): 927 934 . Crossref PubMed Google Scholar

27. Menendez ME , Shaker J , Lawler SM , Ring D , Jawa A . Negative patient-experience comments after total shoulder arthroplasty . J Bone Joint Surg Am . 2019 ; 101-A ( 4 ): 330 337 . Crossref PubMed Google Scholar

28. Tibbo ME , Wyles CC , Fu S , et al. Use of natural language processing tools to identify and classify periprosthetic femur fractures . J Arthroplasty . 2019 ; 34 ( 10 ): 2216 2219 . Crossref PubMed Google Scholar

29. Fu S , Wyles CC , Osmon DR , et al. Automated detection of periprosthetic joint infections and data elements using natural language processing . J Arthroplasty . 2021 ; 36 ( 2 ): 688 692 . Crossref PubMed Google Scholar

30. Olthof AW , Shouche P , Fennema EM , et al. Machine learning based natural language processing of radiology reports in orthopaedic trauma . Comput Methods Programs Biomed . 2021 ; 208 : 106304 . Crossref PubMed Google Scholar

31. Karhade AV , Lavoie-Gagne O , Agaronnik N , et al. Natural language processing for prediction of readmission in posterior lumbar fusion patients: which free-text notes have the most utility? Spine J . 2022 ; 22 ( 2 ): 272 277 . Crossref PubMed Google Scholar

32. Foufi V , Lanteri S , Gaudet-Blavignac C , Remy P , Montet X , Lovis C . Automatic annotation tool to support supervised machine learning for scaphoid fracture detection . Stud Health Technol Inform . 2018 ; 255 : 210 214 . PubMed Google Scholar

33. Dominy CL , Arvind V , Tang JE , et al. Scoliosis surgery in social media: a natural language processing approach to analyzing the online patient perspective . Spine Deform . 2022 ; 10 ( 2 ): 239 246 . Crossref PubMed Google Scholar

34. Galbusera F , Cina A , Bassani T , Panico M , Sconfienza LM . Automatic diagnosis of spinal disorders on radiographic images: leveraging existing unstructured datasets with natural language processing . Glob Spine J . 2023 ; 13 ( 5 ): 1257 1266 . Crossref PubMed Google Scholar

35. Thirukumaran CP , Zaman A , Rubery PT , et al. Natural language processing for the identification of surgical site infections in orthopaedics . J Bone Joint Surg Am . 2019 ; 101-A ( 24 ): 2167 2174 . Crossref PubMed Google Scholar

36. Buchlak QD , Clair J , Esmaili N , Barmare A , Chandrasekaran S . Clinical outcomes associated with robotic and computer-navigated total knee arthroplasty: a machine learning-augmented systematic review . Eur J Orthop Surg Traumatol . 2022 ; 32 ( 5 ): 915 931 . Crossref PubMed Google Scholar

37. Jungmann F , Kämpgen B , Hahn F , et al. Natural language processing of radiology reports to investigate the effects of the COVID-19 pandemic on the incidence and age distribution of fractures . Skeletal Radiol . 2022 ; 51 ( 2 ): 375 380 . Crossref PubMed Google Scholar

38. Borjali A , Magnéli M , Shin D , Malchau H , Muratoglu OK , Varadarajan KM . Natural language processing with deep learning for medical adverse event detection from free-text medical narratives: a case study of detecting total hip replacement dislocation . Comput Biol Med . 2021 ; 129 : 104140 . Crossref PubMed Google Scholar

39. Karhade AV , Bongers MER , Groot OQ , et al. Can natural language processing provide accurate, automated reporting of wound infection requiring reoperation after lumbar discectomy? Spine J . 2020 ; 20 ( 10 ): 1602 1609 . Crossref PubMed Google Scholar

40. Tan WK , Hassanpour S , Heagerty PJ , et al. Comparison of natural language processing rules-based and machine-learning systems to identify lumbar spine imaging findings related to low back pain . Acad Radiol . 2018 ; 25 ( 11 ): 1422 1432 . Crossref PubMed Google Scholar

41. Huhdanpaa HT , Tan WK , Rundell SD , et al. Using natural language processing of free-text radiology reports to identify type 1 modic endplate changes . J Digit Imaging . 2018 ; 31 ( 1 ): 84 90 . Crossref PubMed Google Scholar

42. Wyles CC , Fu S , Odum SL , et al. External validation of natural language processing algorithms to extract common data elements in tha operative notes . J Arthroplasty . 2023 ; 38 ( 10 ): 2081 2084 . Crossref PubMed Google Scholar

43. Karhade AV , Oosterhoff JHF , Groot OQ , et al. Can we geographically validate a natural language processing algorithm for automated detection of incidental durotomy across three independent cohorts from two continents? Clin Orthop Relat Res . 2022 ; 480 ( 9 ): 1766 1775 . Crossref PubMed Google Scholar

44. Flores-Balado Á , Castresana Méndez C , Herrero González A , et al. Using artificial intelligence to reduce orthopedic surgical site infection surveillance workload: Algorithm design, validation, and implementation in 4 Spanish hospitals . Am J Infect Control . 2023 ; 51 ( 11 ): 1225 1229 . Crossref PubMed Google Scholar

45. Tavabi N , Pruneski J , Golchin S , et al. Building large-scale registries from unstructured clinical notes using a low-resource natural language processing pipeline . Health Informatics . 2022 . Crossref PubMed Google Scholar

46. Kita K , Uemura K , Takao M , et al. Use of artificial intelligence to identify data elements for The Japanese Orthopaedic Association National Registry from operative records . J Orthop Sci . 2023 ; 28 ( 6 ): 1392 1399 . Crossref PubMed Google Scholar

47. Davidson EM , Poon MTC , Casey A , et al. The reporting quality of natural language processing studies: systematic review of studies of radiology reports . BMC Med Imaging . 2021 ; 21 : 142 . PubMed Crossref Google Scholar

48. Riley RD , Ensor J , Snell KIE , et al. Calculating the sample size required for developing a clinical prediction model . BMJ . 2020 ; 368 : m441 . Crossref PubMed Google Scholar

49. Dijkstra H , van de Kuit A , de Groot T , et al. Systematic review of machine-learning models in orthopaedic trauma . Bone Jt Open . 2024 ; 5 ( 1 ): 9 19 . Crossref PubMed Google Scholar

50. Moons KGM , Wolff RF , Riley RD , et al. PROBAST: a tool to assess risk of bias and applicability of prediction model studies: explanation and elaboration . Ann Intern Med . 2019 ; 170 ( 1 ): W1 W33 . Crossref PubMed Google Scholar

51. Ormond MJ , Clement ND , Harder BG , Farrow L , Glester A . Acceptance and understanding of artificial intelligence in medical research among orthopaedic surgeons . Bone Jt Open . 2023 ; 4 ( 9 ): 696 703 . Crossref PubMed Google Scholar

Author contributions

L. Farrow: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing, He is the Guarantor.

A. Raja: Conceptualization, Data curation, Investigation, Project administration, Writing – review & editing

M. Zhong: Conceptualization, Supervision, Writing – review & editing

L. Anderson: Conceptualization, Methodology, Supervision, Writing – review & editing

Funding statement

The author(s) disclose receipt of the following financial or material support for the research, authorship, and/or publication of this article: funding for open access publication is provided by the Chief Scientist Office.

ICMJE COI statement

The author(s) disclose receipt of the following financial or material support for the research, authorship, and/or publication of this article: L. Farrow is currently in receipt of a Chief Scientist Office Scotland Clinical Academic Fellowship, which is focused on the use of artificial intelligence techniques (including natural language processing) to improve the clinical care pathway in those referred for hip and knee arthroplasty. L. Farrow is also the guarantor and confirms that all listed authors meet the authorship criteria.

Data sharing

The data that support the findings for this study are available to other researchers from the corresponding author upon reasonable request.

Ethical review statement

Ethical review was not required due to the nature of the study.

Open access funding

The open access funding for this paper was provided by the Chief Scientist Office (ref. CAF 21/06).

Supplementary material

Table showing the example search strategy, and the PRISMA checklist.

© 2025 Farrow et al. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial No Derivatives (CC BY-NC-ND 4.0) licence, which permits the copying and redistribution of the work only, and provided the original author and source are credited. See https://creativecommons.org/licenses/by-nc-nd/4.0/