ConceptBeam Article Swipe
YOU?
·
· 2022
· Open Access
·
· DOI: https://doi.org/10.1145/3503161.3548397
· OA: W4288054834
We propose a novel framework for target speech extraction based on semantic\ninformation, called ConceptBeam. Target speech extraction means extracting the\nspeech of a target speaker in a mixture. Typical approaches have been\nexploiting properties of audio signals, such as harmonic structure and\ndirection of arrival. In contrast, ConceptBeam tackles the problem with\nsemantic clues. Specifically, we extract the speech of speakers speaking about\na concept, i.e., a topic of interest, using a concept specifier such as an\nimage or speech. Solving this novel problem would open the door to innovative\napplications such as listening systems that focus on a particular topic\ndiscussed in a conversation. Unlike keywords, concepts are abstract notions,\nmaking it challenging to directly represent a target concept. In our scheme, a\nconcept is encoded as a semantic embedding by mapping the concept specifier to\na shared embedding space. This modality-independent space can be built by means\nof deep metric learning using paired data consisting of images and their spoken\ncaptions. We use it to bridge modality-dependent information, i.e., the speech\nsegments in the mixture, and the specified, modality-independent concept. As a\nproof of our scheme, we performed experiments using a set of images associated\nwith spoken captions. That is, we generated speech mixtures from these spoken\ncaptions and used the images or speech signals as the concept specifiers. We\nthen extracted the target speech using the acoustic characteristics of the\nidentified segments. We compare ConceptBeam with two methods: one based on\nkeywords obtained from recognition systems and another based on sound source\nseparation. We show that ConceptBeam clearly outperforms the baseline methods\nand effectively extracts speech based on the semantic representation.\n