Lost in the machine: can AI handle dialects in police interviews?

Screenshot of part of a role play police interview transcript.

The study in a sentence

Artificial Intelligence (AI) has been defined as “computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages” (Oxford Languages).

Automatic Speech Recognition (ASR) systems are one type of AI, therefore.

This study investigated whether ASR transcription systems were capable of producing reliable police interview transcripts. It focussed on how well these systems handle non-standard dialects and speech from poor-quality audio.

The ASR systems produced transcripts with more, and more problematic, errors from recordings of speakers with non-standard accents. We therefore need to be cautious about the use of such technology in high-stakes settings such as police interviews.

The question

Transcripts of police interviews are important in the criminal justice system. However, using human transcribers to produce these documents is a time-intensive and costly process. With recent advancements in ASR, it seems likely that automatic transcription could be used to process police-suspect interviews.

However, we don't yet know how good these systems are at transcribing these interactions, particularly when the speakers do not have a standard accent or the audio quality is less than ideal. These questions were addressed in this study.

Key concept

Automatic Speech Recognition (ASR) software requires two key components:

a phonetic model (the acoustic properties of the sounds of the language) plus
a lexical model (all the ways the sounds of the language can be combined into words, and words into phrases, and so on).

The quality and effectiveness of both models is dependent on how representative the training data is that was used to build the model.

What kinds of mistakes does ASR transcription software make? Are the errors worse when dealing with different accents and audio qualities?

Is it practical and/or safe to incorporate AI-based transcription, using ASR, into the formal processing of police-suspect interviews?

Methods

The study tested ASR transcription performance using two British English accents: Standard Southern British English and West Yorkshire English. Three commercially available ASR transcription systems were tested, with different audio qualities.

The analysis assessed different error types:

insertions: when the system adds in an extra word;
deletions: when the system leaves a word out; and
substitutions: when the system incorrectly swaps in a new word rather than transcribing the word that is spoken in the audio recording.

The answer

A higher number of errors were produced for the West Yorkshire English speech compared with the Standard Southern British English speech. The majority of errors for the West Yorkshire speakers were substitution errors. These are potentially more damaging in high-stakes or forensic contexts because they change the meaning of words and could prime listeners or readers into thinking that something was said which never actually appeared in the original audio file. However, many of the errors were quite easy to identify, and therefore should be picked up by a human checker.

Classroom activities

Lead in task

A comedy ASR fail and an accent mapping task

Extension task

Correcting ASR transcripts and producing ROTI summaries

In more detail

Workshop talk (video)

CPD2025_automatic_transcripts_slides.pdf

Workshop talk (slides)

Meet the author

Lauren Harrington

Lauren is a postdoctoral researcher on the Person-specific Automatic Speaker Recognition project. Her research interests are in forensic speech science, sociophonetics and speech technology.

Read the paper

Harrington, L. (2023). Incorporating automatic speech recognition methods into the transcription of police-suspect interviews: factors affecting automatic performance. Frontiers in Communication, 8, 1165233.[DOWNLOAD HERE]

Report abuse