Ziqiang Shi
YOU?
Author Swipe
View article: TrueCount: Improving Open-World Object Counting with Visual-Language Models and Dynamic Multi-Modal Inputs
TrueCount: Improving Open-World Object Counting with Visual-Language Models and Dynamic Multi-Modal Inputs Open
View article: Generative Modelling with High-Order Langevin Dynamics
Generative Modelling with High-Order Langevin Dynamics Open
Diffusion generative modelling (DGM) based on stochastic differential equations (SDEs) with score matching has achieved unprecedented results in data generation. In this paper, we propose a novel fast high-quality generative modelling meth…
View article: ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation
ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation Open
In this paper, we propose a vocoder based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of wave, that …
View article: Iton: End-to-End Audio Generation with Ito Stochastic Differential Equations
Iton: End-to-End Audio Generation with Ito Stochastic Differential Equations Open
View article: Schrowave: Realistic Voice Generation by Solving Two-Stage Conditional Schrodinger Bridge Problems
Schrowave: Realistic Voice Generation by Solving Two-Stage Conditional Schrodinger Bridge Problems Open
View article: Multi-modal Affect Analysis using standardized data within subjects in the Wild
Multi-modal Affect Analysis using standardized data within subjects in the Wild Open
Human affective recognition is an important factor in human-computer interaction. However, the method development with in-the-wild data is not yet accurate enough for practical usage. In this paper, we introduce the affective recognition m…
View article: ItôTTS and ItôWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation
ItôTTS and ItôWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation Open
In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of…
View article: It$\hat{\text{o}}$TTS and It$\hat{\text{o}}$Wave: Linear Stochastic Differential Equation Is All You Need For Audio Generation.
It$\hat{\text{o}}$TTS and It$\hat{\text{o}}$Wave: Linear Stochastic Differential Equation Is All You Need For Audio Generation. Open
In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of…
View article: HiCOMEX: Facial Action Unit Recognition Based on Hierarchy Intensity Distribution and COMEX Relation Learning
HiCOMEX: Facial Action Unit Recognition Based on Hierarchy Intensity Distribution and COMEX Relation Learning Open
The detection of facial action units (AUs) has been studied as it has the competition due to the wide-ranging applications thereof. In this paper, we propose a novel framework for the AU detection from a single input image by grasping the …
View article: Toward the pre-cocktail party problem with TasTas+.
Toward the pre-cocktail party problem with TasTas+. Open
Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}, TasTas \cite{shi2020s…
View article: Toward Speech Separation in The Pre-Cocktail Party Problem with TasTas
Toward Speech Separation in The Pre-Cocktail Party Problem with TasTas Open
In this note, we propose to use TasTas \cite{shi2020speech} for the end-to-end approach to monaural speech separation in the pre-cocktail party problem. Our experiments on the public WSJ0-5mix data corpus results in 10.41dB SDR improvement…
View article: Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss
Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss Open
Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation. This work investigates how to extend dual-path BiLSTM to re…
View article: SingCubic: Cyclic Incremental Newton-type Gradient Descent with Cubic Regularization for Non-Convex Optimization
SingCubic: Cyclic Incremental Newton-type Gradient Descent with Cubic Regularization for Non-Convex Optimization Open
In this work, we generalized and unified two recent completely different works of~\cite{shi2015large} and~\cite{cartis2012adaptive} respectively into one by proposing the cyclic incremental Newton-type gradient descent with cubic regulariz…
View article: Hodge and Podge: Hybrid Supervised Sound Event Detection with Multi-Hot MixMatch and Composition Consistence Training
Hodge and Podge: Hybrid Supervised Sound Event Detection with Multi-Hot MixMatch and Composition Consistence Training Open
In this paper, we propose a method called Hodge and Podge for sound event detection. We demonstrate Hodge and Podge on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Challenge Task 4. This task aims …
View article: LaFurca: Iterative Multi-Stage Refined End-to-End Monaural Speech Separation Based on Context-Aware Dual-Path Deep Parallel Inter-Intra Bi-LSTM
LaFurca: Iterative Multi-Stage Refined End-to-End Monaural Speech Separation Based on Context-Aware Dual-Path Deep Parallel Inter-Intra Bi-LSTM Open
Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}. In this paper, we pro…
View article: LaFurca: Iterative Refined Speech Separation Based on Context-Aware Dual-Path Parallel Bi-LSTM
LaFurca: Iterative Refined Speech Separation Based on Context-Aware Dual-Path Parallel Bi-LSTM Open
Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}. In this paper, we pro…
View article: HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods
HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods Open
In this paper, we present a method called HODGEPODGE\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events (…
View article: Learning from Adversarial Features for Few-Shot Classification
Learning from Adversarial Features for Few-Shot Classification Open
Many recent few-shot learning methods concentrate on designing novel model architectures. In this paper, we instead show that with a simple backbone convolutional network we can even surpass state-of-the-art classification accuracy. The es…
View article: FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks
FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks Open
Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of 1)…
View article: FurcaNet: An end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation
FurcaNet: An end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation Open
Deep gated convolutional networks have been proved to be very effective in single channel speech separation. However current state-of-the-art framework often considers training the gated convolutional networks in time-frequency (TF) domain…
View article: Is CQT more suitable for monaural speech separation than STFT? an empirical study
Is CQT more suitable for monaural speech separation than STFT? an empirical study Open
Short-time Fourier transform (STFT) is used as the front end of many popular successful monaural speech separation methods, such as deep clustering (DPCL), permutation invariant training (PIT) and their various variants. Since the frequenc…
View article: HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods
HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods Open
In this paper, we present a method called HODGEPODGE\\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events …
View article: Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation Open
Deep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an emb…
View article: A Double Joint Bayesian Approach for J-Vector Based Text-dependent Speaker Verification
A Double Joint Bayesian Approach for J-Vector Based Text-dependent Speaker Verification Open
J-vector has been proved to be very effective in text-dependent speaker verification with short-duration speech. However, the current state-of-the-art back-end classifiers, e.g. joint Bayesian model, cannot make full use of such deep featu…
View article: Multi-view Probability Linear Discrimination Analysis for Multi-view Vector Based Text Dependent Speaker Verification.
Multi-view Probability Linear Discrimination Analysis for Multi-view Vector Based Text Dependent Speaker Verification. Open
View article: Multi-view (Joint) Probability Linear Discrimination Analysis for Multi-view Feature Verification
Multi-view (Joint) Probability Linear Discrimination Analysis for Multi-view Feature Verification Open
Multi-view feature has been proved to be very effective in many multimedia applications. However, the current back-end classifiers cannot make full use of such features. In this paper, we propose a method to model the multi-faceted informa…
View article: A better convergence analysis of the block coordinate descent method for large scale machine learning
A better convergence analysis of the block coordinate descent method for large scale machine learning Open
This paper considers the problems of unconstrained minimization of large scale smooth convex functions having block-coordinate-wise Lipschitz continuous gradients. The block coordinate descent (BCD) method are among the first optimization …
View article: Empirical study of PROXTONE and PROXTONE$^+$ for Fast Learning of Large Scale Sparse Models
Empirical study of PROXTONE and PROXTONE$^+$ for Fast Learning of Large Scale Sparse Models Open
PROXTONE is a novel and fast method for optimization of large scale non-smooth convex problem \cite{shi2015large}. In this work, we try to use PROXTONE method in solving large scale \emph{non-smooth non-convex} problems, for example traini…