Register now for our 2025 workshop:
Artificial Intelligence (AI) has been defined as “computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages” (Oxford Languages).
Automatic Speech Recognition (ASR) systems are one type of AI, therefore.
This study investigated whether ASR transcription systems were capable of producing reliable police interview transcripts. It focussed on how well these systems handle non-standard dialects and speech from poor-quality audio.
The ASR systems produced transcripts with more, and more problematic, errors from recordings of speakers with non-standard accents. We therefore need to be cautious about the use of such technology in high-stakes settings such as police interviews.
Transcripts of police interviews are important in the criminal justice system. However, using human transcribers to produce these documents is a time-intensive and costly process. With recent advancements in ASR, it seems likely that automatic transcription could be used to process police-suspect interviews.
However, we don't yet know how good these systems are at transcribing these interactions, particularly when the speakers do not have a standard accent or the audio quality is less than ideal. These questions were addressed in this study.
Automatic Speech Recognition (ASR) software requires two key components:
a phonetic model (the acoustic properties of the sounds of the language) plus
a lexical model (all the ways the sounds of the language can be combined into words, and words into phrases, and so on).
The quality and effectiveness of both models is dependent on how representative the training data is that was used to build the model.
Is it practical and/or safe to incorporate AI-based transcription, using ASR, into the formal processing of police-suspect interviews?
The study tested ASR transcription performance using two British English accents: Standard Southern British English and West Yorkshire English. Three commercially available ASR transcription systems were tested, with different audio qualities.
The analysis assessed different error types:
insertions: when the system adds in an extra word;
deletions: when the system leaves a word out; and
substitutions: when the system incorrectly swaps in a new word rather than transcribing the word that is spoken in the audio recording.
A higher number of errors were produced for the West Yorkshire English speech compared with the Standard Southern British English speech. The majority of errors for the West Yorkshire speakers were substitution errors. These are potentially more damaging in high-stakes or forensic contexts because they change the meaning of words and could prime listeners or readers into thinking that something was said which never actually appeared in the original audio file. However, many of the errors were quite easy to identify, and therefore should be picked up by a human checker.
Workshop talk (video) [will be posted a few weeks after the workshop]
Workshop talk (slides) [coming soon]