We observe that time-varying content representations extracted from automatic speech recognition (ASR) models not only encode the semantics of speech signals (i.e., what is said), but also abundant pronunciation and prosody information (i.e., how it is said).
Our BN2BN model maps the local content features of utterances from an arbitrary number of source accents to those of an arbitrary number of target accents in a single model using a multi-decoder architecture, effectively reducing the accent conversion task to a machine translation problem.