# [2106.01035] Towards Unified Surgical Skill Assessment

### Significance

"Beep. Your surgical skill scored 9 out of 100, Dr. Kim"

### Keypoints

• Propose a method for surgical skill assessment based on multiple aspects of the surgical process
• Experimentally show strengths of the proposed method

### Review

#### Background

Outcome of a surgery can be dependent on the skills of the surgeon. Evaluating the surgical skills in a objective way can be useful for surgeons and students in training, where the scores can be thought as a feedback from a skilled surgeon. There exist many facets for evaluating the surgical skill, such as surgical tool usage, field clearness, or event pattern. Different aspects of surgical skills For example, a skilled surgeon would show more precise and smooth tool trajectory, better visibility of the surgical field, and more linear process of surgical events when compared to unskilled surgeons. Most previous studies on automated surgical skill assessment rely on only one of these aspects. The authors propose a method that incorporates (i) tool-related, (ii) proxy-related (clearness of operating field), and (iii) event-related aspects of the surgical process to provide a score of the surgical skills.

#### Keypoints

##### Propose a method for surgical skill assessment based on multiple aspects of the surgical process

Four input features from a surgical video (V: visual; T: tool; P: proxy; E: event) are extracted as a sequence of vectors. These features are independently processed with the encoding function $\phi_{m}$ and outputs the score sequence $S_{m}$ by a score function $\lambda_{m}$, where $m\in \{\texttt{V},\texttt{T},\texttt{P},\texttt{E} \}$. Although the score sequence represents the skill score based on each aspects of the surgical video, dependency between these aspects are not yet incorporated into the model. The authors propose to infer the importance weight $W_{m}$ with all four input features as the prior so that the relative importance of each aspect can be computed from all four aspects. Specifically, the aggregation function $\psi$ aggregates all four features, followed by weight function $w_{m}$ which outputs the weight of each aspects: $W_{m} = \mathrm{softmax}(w_{m}(\psi(X_{\mathtt{V}},X_{\mathtt{T}},X_{\mathtt{P}},X_{\mathtt{E}})))$. The final skill score $q$ can be computed averaging or summing across the skills and time: \begin{align} q = \frac{1}{4}\sum_{m}\sum_{i=0}^{L} S_{m,i}W_{m,i}, \end{align} where $L$ is the length of the video and $i$ is the index of the video frames. Schematic illustration of the proposed method The score $q$ is trained to minimize MSE loss with the surgical skill score annotated by an expert surgeon. The authors further suggest to add self-supervised contrastive loss to forecast upcoming feature from the current embedding.

##### Experimentally show strengths of the proposed method

The proposed method is evaluated on the JIGSAWS dataset which contains three simulated tasks (suturing, needle passing, knot tying) and a in-house dataset containing surgical video of gastrectomy including the lymph node dissection. The encoding function $\phi$ is defined as ResNet-101, spatial histogram of segmentation masks along with tool position/velocity/angles, operating field clearness implementation according to a previous work, and a multi-stage temporal convolutional networks for the aspect $\texttt{V}$,$\texttt{T}$,$\texttt{P}$, and $\texttt{E}$, respectively.

The proposed method show better spearman’s correlation with the ground truth when compared to baseline methods on the in-house surgical dataset. Performance of the proposed method on in-house dataset The proposed method also outperformed other baselines on the JIGSAWS dataset suggesting strengths of the method. Performance of the proposed method on JIGSAWS dataset. SU: suture; NP: needle passing; KT: knot tying)

Ablation study results and the qualitative analysis of the model outputs are referred to the original paper.