Abstract
Introduction
Real-time tracking of surgical tools has applications for assessment of surgical skill and OR workflow. Accordingly, efforts have been devoted to the development of low-cost systems that track the location of surgical tools in real-time without significant augmentation to the tools themselves. Deep learning methodologies have recently shown success in a multitude of computer vision tasks, including object detection, and thus show potential for the application of surgical tool tracking. The objective of the current study was to develop and evaluate a deep learning-based computer vision system using a single camera for the detection and pose estimation of multiple surgical tools routinely used in both knee and hip arthroplasty.
Methods
A computer vision system was developed for the detection and 6-DoF pose estimation of two surgical tools (mallet and broach handle) using only RGB camera frames. The deep learning approach consisted of a single convolutional neural network (CNN) for object detection and semantic key point prediction, as well as an optimization step to place prior known geometries into the local camera coordinate system. Inference on a camera frame with size of 256-by-352 took 0.3 seconds. The object detection component of the system was evaluated on a manually-annotated stream of video frames. The accuracy of the system was evaluated by comparing pose (position and orientation) estimation of a tool with the ground truth pose as determined using three retroreflective markers placed on each tool and a 14 camera motion capture system (Vicon, Centennial CO). Markers placed on the tool were transformed into the local camera coordinate system and compared to estimated location.
Results
Detection accuracy determined from frame-wise confusion matrices was 82% and 95% for the mallet and broach handle, respectively. Object detection and key point predictions were qualitatively assessed. Marker error resulting from pose estimation was as little as 1.3 cm for the evaluation scenes. Pose estimation of the tools from each evaluation scene was also qualitatively assessed.
Discussion
The proposed computer vision system combined CNNs with optimization to estimate the 6-DoF pose of surgical tools from only RGB camera frames. The system's object detection component performed on par with state-of-the-art object detection literature and the pose estimation error was efficiently computed from CNN predictions. The current system has implications for surgical skill assessment and operations based research to improve operating room efficiency. However, future development is needed to make improvements to the object detection and key point prediction components of the system, in order to minimize potential pose error. Nominal marker errors of 1.3 cm demonstrate the potential of this system to yield accurate pose estimates of surgical tools.
For any figures or tables, please contact authors directly.