A Fatorial Deep Markov Model For Unsupervised Disentangled Representation Learning From Speech

Abstract

We present the Factorial Deep Markov Model (FDMM) for representation learning of speech. The FDMM learns disentangled, interpretable and lower dimensional latent representations from speech without supervision. We use a static and dynamic latent variable to exploit the fact that information in a speech signal evolves at different time scales. Latent representations learned by the FDMM outperform a baseline ivector system on speaker verification and dialect identification while also reducing the error rate of a phone recognition system in a domain mismatch scenario.

Publication
In International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Date
Links