A Fatorial Deep Markov Model For Unsupervised Disentangled Representation Learning From Speech

Sameer Khurana, Shafiq Joty, Ahmed Ali, James Glass

Abstract

We present the Factorial Deep Markov Model (FDMM) for representation learning of speech. The FDMM learns disentangled, interpretable and lower dimensional latent representations from speech without supervision. We use a static and dynamic latent variable to exploit the fact that information in a speech signal evolves at different time scales. Latent representations learned by the FDMM outperform a baseline ivector system on speaker verification and dialect identification while also reducing the error rate of a phone recognition system in a domain mismatch scenario.

Type

Conference paper

Publication

In International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

Date

April, 2019

Links