Nobukatsu Hojo
YOU?
Author Swipe
View article: Data stream-pairwise bottleneck transformer for engagement estimation from video conversation
Data stream-pairwise bottleneck transformer for engagement estimation from video conversation Open
This study aims to assess participant engagement in multiparty conversations using video and audio data. For this task, the interaction among numerous data streams, such as video and audio from multiple participants, should be modeled effe…
View article: ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind Open
Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality…
View article: Multimodal Fine-Grained Apparent Personality Trait Recognition: Joint Modeling of Big Five and Questionnaire Item-level Scores
Multimodal Fine-Grained Apparent Personality Trait Recognition: Joint Modeling of Big Five and Questionnaire Item-level Scores Open
This paper presents a novel method for automatically recognizing people's apparent personality traits as perceived by others. In previous studies, apparent personality trait recognition from multimodal human behavior is often modeled to di…
View article: ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind Open
Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality…
View article: End-to-End Joint Target and Non-Target Speakers ASR
End-to-End Joint Target and Non-Target Speakers ASR Open
This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Target-speaker ASR …
View article: Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss
Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss Open
Self-supervised learning (SSL) is the latest breakthrough in speech processing, especially for label-scarce downstream tasks by leveraging massive unlabeled audio data. The noise robustness of the SSL is one of the important challenges to …
View article: MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames
MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames Open
Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owi…
View article: Communication with Desired Voice
Communication with Desired Voice Open
We convey/understand our intentions/feelings through speech.We also change the impression we want others to have of us by controlling our voice, including intonation, speaking characteristics, and rhythm.Unfortunately, the types of voice a…
View article: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion
CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion Open
Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising …
View article: VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics
VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics Open
In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and La…
View article: Nonparallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks
Nonparallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks Open
We previously proposed a method that allows for nonparallel voice conversion (VC) by using a variant of generative adversarial networks (GANs) called StarGAN. The main features of our method, called StarGAN-VC, are as follows: First, it re…
View article: Estimating Sentence Final Tone Labels using Dialogue-Act Information for Text-to-Speech Synthesis within a Spoken Dialogue System
Estimating Sentence Final Tone Labels using Dialogue-Act Information for Text-to-Speech Synthesis within a Spoken Dialogue System Open
This paper proposes a novel sentence final tone labels estimation method using dialogue-act (DA) informationfor text-to-speech synthesis within a spoken dialogue system. Estimating appropriate sentence final tone labels isconsidered essent…
View article: Many-to-Many Voice Transformer Network
Many-to-Many Voice Transformer Network Open
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework, which enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech. We previously pro…
View article: DNN-based Speech Synthesis using Dialogue-Act Information and Its Evaluation with Respect to Illocutionary Act Naturalness
DNN-based Speech Synthesis using Dialogue-Act Information and Its Evaluation with Respect to Illocutionary Act Naturalness Open
This paper aims at improving naturalness of synthesized speech generated by a text-to-speech (TTS) systemwithin a spoken dialogue system with respect to “how natural the system’s intention is perceived via the synthesizedspeech”. We call t…
View article: ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion
ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion Open
This article proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed met…
View article: StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion Open
Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the …
View article: CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion Open
Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training condi…
View article: WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation
WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation Open
WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive mod…
View article: AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and\n Context Preservation Mechanisms
AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and\n Context Preservation Mechanisms Open
This paper describes a method based on a sequence-to-sequence learning\n(Seq2Seq) with attention and context preservation mechanism for voice\nconversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving\nsequence modeli…
View article: AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms
AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms Open
This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling …
View article: ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion Open
This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed metho…
View article: WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks
WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks Open
We propose a learning-based filter that allows us to directly modify a synthetic speech waveform into a natural speech waveform. Speech-processing systems using a vocoder framework such as statistical parametric speech synthesis and voice …
View article: ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder
ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder Open
This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, it…
View article: StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks Open
This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requir…
View article: Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms
Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms Open
In this paper, we address the problem of reconstructing a time-domain signal (or a phase spectrogram) solely from a magnitude spectrogram. Since magnitude spectrograms do not contain phase information, we must restore or infer phase inform…
View article: DNN-Based Speech Synthesis Using Speaker Codes
DNN-Based Speech Synthesis Using Speaker Codes Open
Deep neural network (DNN)-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, it is not revealed whether the synthesized speech quality can be improved by utilizing …