Aims. To develop a core outcome set of measurements from postoperative radiographs that can be used to assess technical skill in performing dynamic hip screw (DHS) and hemiarthroplasty, and to validate these against Van der Vleuten’s criteria for effective assessment. Methods. A Delphi exercise was undertaken at a regional major trauma centre to identify candidate measurement items. The feasibility of taking these measurements was tested by two of the authors (HKJ, GTRP). Validity and reliability were examined using the radiographs of operations performed by orthopaedic resident participants (n = 28) of a multicentre randomized controlled educational trial (ISRCTN20431944). Trainees were divided into novice and intermediate groups, defined as having performed < ten or ≥ ten cases each for DHS and hemiarthroplasty at baseline. The procedure-based assessment (PBA) global rating score was assumed as the gold standard assessment for the purposes of concurrent validity. Intra- and inter-rater reliability testing were performed on a random subset of 25 cases. Results. In total, 327 DHS and 248 hemiarthroplasty procedures were performed by 28 postgraduate year (PGY) 3 to 5 orthopaedic trainees during the 2014 to 2015 surgical training year at nine NHS hospitals in the West Midlands, UK. Overall, 109 PBAs were completed for DHS and 80 for hemiarthroplasty. Expert consensus identified four ‘final product analysis’ (FPA) radiological parameters of technical success for DHS: tip-apex distance (TAD); lag screw position in the femoral head; flushness of the plate against the lateral femoral cortex; and eight-cortex hold of the plate screws. Three parameters were identified for hemiarthroplasty: leg length discrepancy; femoral stem alignment; and femoral offset. Face validity, content validity, and feasibility were excellent. For all measurements, performance was better in the intermediate compared with the novice group, and this was statistically significant for TAD (p < 0.001) and femoral stem alignment (p = 0.023). Concurrent validity was poor when measured against global PBA score. This may be explained by the fact that they are measuring difference