Building a Speaker Diarization System: Lessons from VoxSRC 2023

Authors

  • Davit S. Karamyan Russian-Armenian University; Krisp.ai
  • Grigor A. Kirakosyan Krisp.ai; Institute of Mathematics of NAS RA

DOI:

https://doi.org/10.51408/1963-0109

Keywords:

Speaker recognition, Speaker diarization, VoxSRC 2023

Abstract

Speaker diarization is the process of partitioning an audio recording into segments corresponding to individual speakers. In this paper, we present a robust speaker diarization system and describe its architecture. We focus on discussing the key components necessary for building a strong diarization system, such as voice activity detection (VAD), speaker embedding, and clustering. Our system emerged as the winner in the Voxceleb Speaker Recognition Challenge (VoxSRC) 2023, a widely recognized competition for evaluating speaker diarization systems.

References

Q. Wang, C. Downey, L. Wan, P. Mansfield and I.Moreno, “Speaker diarization with LSTM”, 2018 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 5239-5243, 2018.

X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, S. Chen, Y. Zhao, G. Liu, Y. Wu, J. Wu and et.a, “Microsoft speaker diarization system for the voxceleb speaker recognition challenge 2020”, IEEE International Conference on Acoustics, Speech And Signal Processing (ICASSP), pp. 5824-5828, 2021.

T. Park, N. Kanda, D. Dimitriadis, K. Han, S. Watanabe and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning”, Computer Speech & Language, vol. 72, pp. 101317, 2022.

R. Yin, H. Bredin and C. Barras, “Speaker change detection in broadcast TV using bidirectional long short-term memory networks”, Interspeech 2017, 2017.

W. Xia, H. Lu, Q. Wang, A. Tripathi, Y. Huang, I. Moreno and H. Sak, “Turn-todiarize: Online speaker diarization constrained by transformer transducer speaker turn detection”, IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 8077-8081, 2022.

T. Park, M. Kumar and S. Narayanan, “Multi-scale speaker diarization with neural affinity score fusion”, IEEE International Conference on Acoustics, Speech And Signal Processing (ICASSP), pp. 7173-7177, 2021.

Y. Kwon, H. Heo, J. Jung, Y. Kim, B. Lee and J. Chung, “Multi-scale speaker embedding-based graph attention networks for speaker diarisation”, IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 8367-8371, 2022.

I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny , A. Laptev and A. Pomanenko, “Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario”, arXiv:2005.07272, 2020.

W. Wang, D. Cai, Q. Lin, L. Yang, J. Wang, J. Wang and M. Li, “The dku-dukeecelenovo system for the diarization task of the 2021 voxceleb speaker recognition challenge”, arXiv:2109.02002, 2021.

W. Wang, X. Qin, M. Cheng, Y. Zhang, K. Wang, and M. Li, “The dku-dukeece diarization system for the voxceleb speaker recognition challenge 2022’, arXiv:2210.01677, 2022.

M. Cheng, W. Wang, Y. Zhang, X. Qin and M. Li, “Target-speaker voice activity detection via sequence-to-sequence prediction” IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 1-5, 2023.

Y. Wang, M. He, S. Niu, L. Sun, T. Gao, X. Fang, J. Pan, J. Du and C. Lee, “USTCNELSLIP system description for DIHARD-III challenge”, arXiv:2103.10661, 2021.

J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz and M. Brudno, “Speaker diarization with session-level speaker embedding refinement using graph neural networks”, IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 7109-7113, 2020.

Y. Kwon, J. Jung, H. Heo, Y. Kim, B. Lee and J. Chung, “Adapting speaker embeddings for speaker diarisation”, arXiv:2104.02879, 2021.

D. Karamyan, G. Kirakosyan and S. Harutyunyan, “Making speaker diarization system noise tolerant”, Mathematical Problems of Computer Science. vol.59, pp. 57-68, 2023.

A. Zhang, Q. Wang, Z. Zhu, J. Paisley and C. Wang, “Fully supervised speaker diarization”, Proceedings of IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 6301-6305, 2019.

Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu and S. Watanabe, “End-to-end neural speaker diarization with permutation-free objectives”, Proc. Interspeech, pp. 4300-4304, 2019.

Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu and S. Watanabe, “End-to-end neural speaker diarization with self-attention”, Proceedings of IEEE Automatic Speech Recognition And Understanding Workshop (ASRU), pp. 296-303, 2019.

J. Chung, J. Huh, A. Nagrani, T. Afouras and A. Zisserman, “Spot the conversation: Speaker diarisation in the wild”, Proc. Interspeech 2020, pp. 299-303, 2020.

Y. Xu, J. Du, L. Dai, and C. Lee, “A regression approach to speech enhancement based on deep neural networks”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, pp. 7-19, 2014.

Bredin, H., Yin, R., Coria, J., Gelly, G., Korshunov, P., Lavechin, M., Fustes, D., Titeux, H., Bouaziz, W. & Gill, M. Pyannote. audio: neural building blocks for speaker diarization. ICASSP 2020-2020 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP). pp. 7124-7128 (2020).

N. Koluguri, T. Park and B. Ginsburg, “TitaNet: Neural model for speaker representation with 1D Depth-wise separable convolutions and global context”, Proceedings of IEEE International Conference on Acoustics, Speech And Signal Processing (ICASSP), pp. 8102-8106, 2022.

J. Jung, Y. Kim, H. Heo, B. Lee, Y. Kwon and J. Chung, “Pushing the limits of raw waveform speaker recognition”, 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, pp. 2228-2232, 2022.

B. Desplanques, J. Thienpondt and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification”, Proc. Interspeech 2020, pp. 3830-3834, 2020.

J. Deng, J. Guo, N. Xue and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690-4699, 2019.

A. Nagrani, J. Chung and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset”, arXiv:1706.08612, 2017.

J. Chung, A. Nagrani and A. Zisserman, “Voxceleb2: Deep speaker recognition”, arXiv :1806.05622, 2018.

A. Stolcke and T. Yoshioka, “DOVER: A method for combining diarization outputs”, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 757-763, 2019.

D. Raj, L. Garcia-Perera, Z. Huang, S. Watanabe, D. Povey, A. Stolcke and S. Khudanpur, “Dover-lap: A method for combining overlap-aware diarization outputs”, IEEE Spoken Language Technology Workshop (SLT), pp. 881-888, 2021.

Downloads

Published

2023-11-30

How to Cite

Karamyan, D. S., & Kirakosyan, G. A. (2023). Building a Speaker Diarization System: Lessons from VoxSRC 2023. Mathematical Problems of Computer Science, 60, 52–62. https://doi.org/10.51408/1963-0109