OBJECTIVES - We sought to validate the use of crowdsourced surgical video assessment in the evaluation of urology residents performing flexible ureteroscopic laser lithotripsy.
METHODS - We collected video feeds from 30 intrarenal ureteroscopic laser lithotripsy cases where residents, postgraduate year (PGY) two through six, handled the ureteroscope. The video feeds were annotated to represent overall performance and to contain parts of the procedure being scored. Videos were submitted to a commercially available surgical video evaluation platform (Crowd-Sourced Assessment of Technical Skills). We used a validated ureteroscopic laser lithotripsy global assessment tool that was modified to include only those domains that could be evaluated on the captured video. Videos were evaluated by crowd workers recruited using Amazon's Mechanical Turk platform as well as five endourology-trained experts. Mean scores were calculated and intraclass correlation coefficients (ICCs) were computed for the expert domain and total scores. ICCs were estimated using a linear mixed-effects model. Spearman rank correlation coefficients were calculated as a measure of the strength of the relationships between the crowd mean and expert average scores.
RESULTS - A total of 30 videos were reviewed 2488 times by 487 crowd workers and five expert endourologists. ICCs between expert raters were all below accepted levels of correlation (0.30), with the overall score having an ICC of <0.001. For individual domains, the crowd scores did not correlate with expert scores, except for the stone retrieval domain (0.60 p = 0.015). In addition, crowdsourced scores had a negative correlation with the PGY level (0.44, p = 0.019).
CONCLUSIONS - There is poor agreement between experts and poor correlation between expert and crowd scores when evaluating video feeds of ureteroscopic laser lithotripsy. The use of an intraoperative video of ureteroscopy with laser lithotripsy for assessment of resident trainee skills does not appear reliable. This is further supported by the lack of correlation between crowd scores and advancing PGY level.