After a brief survey of current literature, it appears that no standardized benchmarks for ASR on clinical data exists that are widely used. Given the vast resources available from the TalkBank corpus, it is not infeasible to build such a corpus and evaluate the performance of a few commercial ASR systems in its ability to perform such a task.
Despite there not being a single baseline that works to benchmark ASR on clinical datasets, a few different subsets of efforts exists on each component of this front.
Evaluation Datasets
As perhaps an exemplar to the lack of standardization in ASR performance evaluation, the Whisper ASR model ((NO_ITEM_DATA:radford2022robust)) was not evaluated on one particular benchmark but instead a series of multi-domain benchmarks.
This is perhaps for good reason, recent results (discuses below) show that single-domain benchmarks do not describe performance well across other domains, or even other usage methods. Therefore, the battery of different tests done by (Radford et al. 2022) could be essentially thought of as a single battery of multi-usage tests that covers a good span of recent-standard ASR performance tests; among them:
Standard Datasets
- CORAAL: a dataset of high-quality lower-fidelity recordings of African-American vernacular of varying degrees in conversation (incl. cross talk, etc.) ((Farrington and Kendall 2021))
- EARNINGS: a set of benchmark datasets of earnings calls within various financial industries ((Rio et al. 2021))
- TED-LIUM 3: a dataset of high-fidelity recordings of full-length TED talks ((Hernandez et al. 2018))
In addition to the three evaluation datasets provided, the entire model was also trained on (Panayotov et al. 2015), a gold-standard corpus of open-source high-fidelity recordings of audiobooks.
“Home brew” Benchmarks
In addition to the three published, standard datasets above, Radford et al. also used a series of self-selected datasets of varying quality.
- Rev16: Rev.AI’s clean podcast transcription dataset https://www.rev.ai/blog/podcast-transcription-benchmark-part-1/
- Meanwhile: Recordings of Stephen Colbert’s Meanwhile segments
- Kinkaid46: ….apparently a selection of YouTube videos from this guy’s blog post: https://medium.com/descript/which-automatic-transcription-service-is-the-most-accurate-2018-2e859b23ed19
Evaluation Metrics
Most benchmarks still report results in terms of word-error-rate (WER) and the standard lexical distance metric of BLEU ((Papineni et al. 2001)). These are two generally well-accepted ways of reporting ASR performance, and for most of the datasets cited above suffice.
However, some very recent results ((Shor et al. 2023)) indicate that BLEU and WER themselves do not capture a good view for what would be clinically relevant data. Some ASR mistakes (such as that on the investigator, or that which doesn’t relate to the disfluency being observed) matter a lot less than others (errors on the participant, esp. missing filled pauses, wrongly coded utterances etc.). The work by Shor et al. presents also an alternative metric to quantify such errors: essentially training a BERT model ((Devlin et al. 2018)) to perform the classification task of “clinician preference” (i.e. “predict which of these errors would be less problematic to a clinician”), then using the results of that model to evaluate the ASR performance.
This last method is likely overkill. However, it is useful to discuss if richer information–such as special binning for missing clinically significant markers, like filled pauses—in addition to simple BLEU and WER will be useful as we develop our own benchmarks.
Discussion
(Szymański et al. 2020) (sec. 3, “Call to Action) offers some guidance with regards to the design of robust ASR benchmarks. Among which:
- Higher quality annotations, like morphology information we provide with %mor, to help aid language model training
- Broader range of human diects and variations covered
- Performance across many recording domains (various processing of audio signals, properties of the signal itself, etc.)
Though TalkBank contains a wealth of data, individual corpuses often have little variation, which Szymańki et. al. shows cause degraded performance. Therefore, it is useful to create a benchmark that strives across multiple problem domains and recording schemes to be able to provide a reproducible and more accurate benchmark of a given model.
Szymańki et. al. also brought up another issue through their paper: “due to legal constraints … we are not able to provide the community with neither the benchmark data nor the detailed information about evaluated systems.” ((Szymański et al. 2020), section 2.) The anonymization of benchmarked models seen in both Szymańki et. al. and Radford et. al. may point to a certain legal barrier in specifically benchmarking existing, commercial ASR models.