Konstantinos G. Derpanis
YOU?
Author Swipe
View article: Revisiting Image Fusion for Multi-Illuminant White-Balance Correction
Revisiting Image Fusion for Multi-Illuminant White-Balance Correction Open
White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input…
View article: Geometry-Aware Diffusion Models for Multiview Scene Inpainting
Geometry-Aware Diffusion Models for Multiview Scene Inpainting Open
In this paper, we focus on 3D scene inpainting, where parts of an input image set, captured from different viewpoints, are masked out. The main challenge lies in generating plausible image completions that are geometrically consistent acro…
View article: Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Open
We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a si…
View article: Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models
Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models Open
Understanding what deep network models capture in their learned representations is a fundamental challenge in computer vision. We present a new methodology to understanding such vision models, the Visual Concept Connectome (VCC), which dis…
View article: PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis
PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis Open
This paper considers the problem of generative novel view synthesis (GNVS), generating novel, plausible views of a scene given a limited number of known views. Here, we propose a set-based generative model that can simultaneously generate …
View article: Understanding Video Transformers via Universal Concept Discovery
Understanding Video Transformers via Universal Concept Discovery Open
This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that …
View article: Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D Scene Representations
Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D Scene Representations Open
Neural Radiance Fields (NeRFs) have proven to be powerful 3D representations, capable of high quality novel view synthesis of complex scenes. While NeRFs have been applied to graphics, vision, and robotics, problems with slow rendering spe…
View article: GePSAn: Generative Procedure Step Anticipation in Cooking Videos
GePSAn: Generative Procedure Step Anticipation in Cooking Videos Open
We study the problem of future step anticipation in procedural videos. Given a video of an ongoing procedural activity, we predict a plausible next procedure step described in rich natural language. While most previous work focus on the pr…
View article: Dual-Camera Joint Deblurring-Denoising
Dual-Camera Joint Deblurring-Denoising Open
Recent image enhancement methods have shown the advantages of using a pair of long and short-exposure images for low-light photography. These image modalities offer complementary strengths and weaknesses. The former yields an image that is…
View article: Watch Your Steps: Local Image and Scene Editing by Text Instructions
Watch Your Steps: Local Image and Scene Editing by Text Instructions Open
Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy betw…
View article: StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos
StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos Open
Instructional videos are an important resource to learn procedural tasks from human demonstrations. However, the instruction steps in such videos are typically short and sparse, with most of the video being irrelevant to the procedure. Thi…
View article: Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models
Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models Open
Novel view synthesis from a single input image is a challenging task, where the goal is to generate a new view of a scene from a desired camera pose that may be separated by a large motion. The highly uncertain nature of this synthesis tas…
View article: Reference-guided Controllable Inpainting of Neural Radiance Fields
Reference-guided Controllable Inpainting of Neural Radiance Fields Open
The popularity of Neural Radiance Fields (NeRFs) for view synthesis has led to a desire for NeRF editing tools. Here, we focus on inpainting regions in a view-consistent and controllable manner. In addition to the typical NeRF inputs and m…
View article: SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields
SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields Open
Neural Radiance Fields (NeRFs) have emerged as a popular approach for novel view synthesis. While NeRFs are quickly being adapted for a wider set of applications, intuitively editing NeRF scenes is still an open challenge. One important ed…
View article: Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks
Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks Open
There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appear…
View article: SAGE: Saliency-Guided Mixup with Optimal Rearrangements
SAGE: Saliency-Guided Mixup with Optimal Rearrangements Open
Data augmentation is a key element for training accurate models by reducing overfitting and improving generalization. For image classification, the most popular data augmentation techniques range from simple photometric and geometrical tra…
View article: A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information
A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information Open
Deep spatiotemporal models are used in a variety of computer vision tasks, such as action recognition and video object segmentation. Currently, there is a limited understanding of what information is captured by these models in their inter…
View article: P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision
P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision Open
In this paper, we study the problem of procedure planning in instructional videos. Here, an agent must produce a plausible sequence of actions that can transform the environment from a given start to a desired goal state. When learning pro…
View article: Uncertainty-based Cross-Modal Retrieval with Probabilistic Representations
Uncertainty-based Cross-Modal Retrieval with Probabilistic Representations Open
Probabilistic embeddings have proven useful for capturing polysemous word meanings, as well as ambiguity in image matching. In this paper, we study the advantages of probabilistic embeddings in a cross-modal setting (i.e., text and images)…
View article: Temporal Transductive Inference for Few-Shot Video Object Segmentation
Temporal Transductive Inference for Few-Shot Video Object Segmentation Open
Few-shot video object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. In this paper, we present a simple but effective temporal transductive inference (TTI) a…
View article: Semantic Keypoint-Based Pose Estimation from Single RGB Frames
Semantic Keypoint-Based Pose Estimation from Single RGB Frames Open
This paper presents an approach to estimating the continuous 6-DoF pose of an\nobject from a single RGB image. The approach combines semantic keypoints\npredicted by a convolutional network (convnet) with a deformable shape model.\nUnlike …
View article: Simpler Does It: Generating Semantic Labels with Objectness Guidance
Simpler Does It: Generating Semantic Labels with Objectness Guidance Open
Existing weakly or semi-supervised semantic segmentation methods utilize image or box-level supervision to generate pseudo-labels for weakly labeled images. However, due to the lack of strong supervision, the generated pseudo-labels are of…
View article: Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers
Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers Open
In this work, we consider the problem of sequence-to-sequence alignment for signals containing outliers. Assuming the absence of outliers, the standard Dynamic Time Warping (DTW) algorithm efficiently computes the optimal alignment between…
View article: Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers
Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers Open
In this work, we consider the problem of sequence-to-sequence alignment for signals containing outliers. Assuming the absence of outliers, the standard Dynamic Time Warping (DTW) algorithm efficiently computes the optimal alignment between…
View article: SegMix: Co-occurrence Driven Mixup for Semantic Segmentation and Adversarial Robustness
SegMix: Co-occurrence Driven Mixup for Semantic Segmentation and Adversarial Robustness Open
In this paper, we present a strategy for training convolutional neural networks to effectively resolve interference arising from competing hypotheses relating to inter-categorical information throughout the network. The premise is based on…
View article: Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs
Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs Open
In this paper, we challenge the common assumption that collapsing the spatial dimensions of a 3D (spatial-channel) tensor in a convolutional neural network (CNN) into a vector via global pooling removes all spatial information. Specificall…
View article: Learning Multi-Scale Photo Exposure Correction
Learning Multi-Scale Photo Exposure Correction Open
Capturing photographs with wrong exposures remains a major source of errors in camera-based imaging. Exposure problems are categorized as either: (i) overexposed, where the camera exposure was too long, resulting in bright and washed-out i…
View article: Representation Learning via Global Temporal Alignment and Cycle-Consistency
Representation Learning via Global Temporal Alignment and Cycle-Consistency Open
We introduce a weakly supervised method for representation learning based on aligning temporal sequences (e.g., videos) of the same process (e.g., human action). The main idea is to use the global temporal ordering of latent correspondence…