Publications

Slot Attention-based Concept Discovery To Mitigate Spurious Correlations

Published in CVPR (Submitted), 2023

Models relying on spurious correlations present in their training data can yield brittle predictions and introduce undesired biases. In this paper, we introduce a mechanism to mitigate these spurious correlations by leveraging unsupervised concept discovery. In the forward pass, we decompose images using object-centric slots, each attending to distinct regions in the input. During training, we create clusters of concepts by matching extracted slot-wise features with a learned dictionary of vector quantized codes. This procedure provides a means of monitoring and influencing the features learned by the classifier. Specifically, by controlling the sampling in stochastic gradient descent, we over-represent images of desired concepts, thereby reducing the impact of spurious correlations. We assess our method on various benchmark datasets for subpopulation shifts, demonstrating consistent improvements in performance without human-annotated groups.

Download here

Continual learning with foundation models: An empirical study of latent replay

Published in 1st Conference on Lifelong Learning Agent 2022, 2022

Rapid development of large-scale pre-training has resulted in foundation models that can act as effective feature extractors on a variety of downstream tasks and domains. Motivated by this, we study the efficacy of pre-trained vision models as a foundation for downstream continual learning (CL) scenarios. Our goal is twofold. First, we want to understand the compute-accuracy trade-off between CL in the raw-data space and in the latent space of pre-trained encoders. Second, we investigate how the characteristics of the encoder, the pre-training algorithm and data, as well as of the resulting latent space affect CL performance. For this, we compare the efficacy of various pre-trained models in large-scale benchmarking scenarios with a vanilla replay setting applied in the latent and in the raw-data space. Notably, this study shows how transfer, forgetting, task similarity and learning are dependent on the input data characteristics and not necessarily on the CL algorithms. First, we show that under some circumstances reasonable CL performance can readily be achieved with a non-parametric classifier at negligible compute. We then show how models pre-trained on broader data result in better performance for various replay sizes. We explain this with representational similarity and transfer properties of these representations. Finally, we show the effectiveness of self-supervised (SSL) pre-training for downstream domains that are out-of-distribution as compared to the pre-training domain. We point out and validate several research directions that can further increase the efficacy of latent CL including representation ensembling. The diverse set of datasets used in this study can serve as a compute-efficient playground for further CL research.

Download here

AvaTr: One-Shot Speaker Extraction with Transformers

Published in InterSpeech 2021, 2021

To extract the voice of a target speaker when mixed with a variety of other sounds, such as white and ambient noises or the voices of interfering speakers, we extend the Transformer network to attend the most relevant information with respect to the target speaker given the characteristics of his or her voices as a form of contextual information. The idea has a natural interpretation in terms of the selective attention theory. Specifically, we propose two models to incorporate the voice characteristics in Transformer based on different insights of where the feature selection should take place. Both models yield excellent performance, on par or better than published state-of-the art models on the speaker extraction task, including separating speech of novel speakers not seen during training.

Download here

Multi-Image Super-Resolution for Remote Sensing using Deep Recurrent Networks

Published in CVPR 2020 Workshop, 2020

High-resolution satellite imagery is critical for various earth observation applications related to environment monitoring, geoscience, forecasting, and land use analysis. However, the acquisition cost of such high-quality imagery due to the scarcity of providers and needs for highfrequency revisits restricts its accessibility in many fields. In this work, we present a data-driven, multi-image super resolution approach to alleviate these problems. Our approach is based on an end-to-end deep neural network that consists of an encoder, a fusion module, and a decoder. The encoder extracts co-registered highly efficient feature representations from low-resolution images of a scene. A Gated Recurrent Unit (GRU)-based module acts as the fusion module, aggregating features into a combined representation. Finally, a decoder reconstructs the super-resolved image. The proposed model is evaluated on the PROBA-V dataset released in a recent competition held by the European Space Agency. Our results show that it performs among the top contenders and offers a new practical solution for realworld applications.

Download here