Acoustic Features Fusion using Attentive Multi-channel Deep Architecture

Paper
Code

Abstract

In this paper, we present a novel deep fusion architecture for audio classification tasks. The multi-channel model presented is formed using deep convolution layers where different acoustic features are passed through each channel. To enable dissemination of information across the channels, we introduce attention feature maps that aid in the alignment of frames. The output of each channel is merged using interaction parameters that non-linearly aggregate the representative features. Finally, we evaluate the performance of the proposed architecture on three benchmark datasets :- DCASE-2016 and LITIS Rouen (acoustic scene recognition), and CHiME-Home (tagging). Our experimental results suggest that the architecture presented outperforms the standard baselines and achieves outstanding performance on the task of acoustic scene recognition and audio tagging.

Complementary Acoustic Features

The Mel and log-Mel are a set of complementary acoustic features (CAF). The Mel frequencies capture classes which lie in the higher frequency domain and log-Mel frequencies capture classes that lie in the lower frequency domain. We conjecture that passing the features via a multi-channel model it is possible to efficiently combine the complementary properties inhibited by these features.

Multi-Channel Deep Fusion

Early fusion
Late Fusion
Parameters Sharing

Experiments and Results

Datasets

DCASE-2016
LITIS Rouen
CHiME-Home

Hyperparameters

Baselines

For DCASE-2016 we use the Gaussian Mixture Model (GMM) with MFCC (including acceleration and delta coefficients) as the baseline system. This baseline is provided by DCASE-2016 organizers. The other baseline used is DNN with mel-components. For LITIS-Rouen we use the HOG+CQA and DNN + MFCC results as the baseline. For the CHiME-Home dataset, we use the standard baseline of the MFCC+GMM system and mel+DNN.

Results

Note: vanilla (no fusion), early fusion (EF), late fusion (LF) and hybrid (EF+LF)

Conclusion

In this paper, we present a multi-channel architecture for the fusion of complementary acoustic features. Our idea is based on the fact that the introduction of attention parameters between the channels results in better convergence. The proposed technique is general and can be applied to any audio classification task. A possible extension to our work would be to use the pairs or triplets of audio samples of similar classes and pass them through the multi-channel architecture. This could help to align the diverse audio samples of similar classes, making the model robust to audio samples that are difficult to classify.