Speaker diarization or speaker segmentation is the process of automatically assigning a speaker identity to each segment of the audio file. Segmenting by speaker is very useful in several applications to understand who said what in a conversation. Typically speaker information is crucial for applications such as emotion detection, behavioural analysis or topic analysis of the people involved in the conversation. For instance, when processing a customer service call, one might want to analyze the sentiment of the customer and the service agent separately.
Typical techniques for speaker diarization involve two steps.
- Segmentation : Segmentation aims to detect all speaker change points. Common techniques for this have been based on BIC (Bayesian Information Criterion) and RNN (Recurrent Neural Networks). Further while segmentation is mostly done by acoustic cues (MFCC features), lexical cues (transcript) can be further used to improve segmentation.
- Clustering : Identify which speaker each segment was uttered by. Typically systems based on agglomerative clustering have been successful.
Recent deep learning algorithms have also been proposed to combine the task of segmentation and clustering.