Samyak Datta
YOU?
Author Swipe
View article: Movie Gen: A Cast of Media Foundation Models
Movie Gen: A Cast of Media Foundation Models Open
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and ge…
View article: DISGO: Automatic End-to-End Evaluation for Scene Text OCR
DISGO: Automatic End-to-End Evaluation for Scene Text OCR Open
This paper discusses the challenges of optical character recognition (OCR) on natural scenes, which is harder than OCR on documents due to the wild content and various image backgrounds. We propose to uniformly use word error rates (WER) a…
View article: Episodic Memory Question Answering
Episodic Memory Question Answering Open
Egocentric augmented reality devices such as wearable glasses passively capture visual data as a human wearer tours a home environment. We envision a scenario wherein the human communicates with an AI agent powering such a device by asking…
View article: Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents
Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents Open
Recent work has presented embodied agents that can navigate to point-goal targets in novel indoor environments with near-perfect accuracy. However, these agents are equipped with idealized sensors for localization and take deterministic ac…
View article: Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents
Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents Open
Recent work has presented embodied agents that can navigate to point-goal targets in novel indoor environments with near-perfect accuracy. However, these agents are equipped with idealized sensors for localization and take deterministic ac…
View article: Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment Open
We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phras…
View article: Embodied Question Answering in Photorealistic Environments with Point Cloud Perception
Embodied Question Answering in Photorealistic Environments with Point Cloud Perception Open
To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). W…
View article: Unsupervised Learning of Face Representations
Unsupervised Learning of Face Representations Open
We present an approach for unsupervised training of CNNs in order to learn discriminative face representations. We mine supervised training data by noting that multiple faces in the same video frame must belong to different persons and the…
View article: Embodied Question Answering
Embodied Question Answering Open
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?"). In order to answer, the agent must first intelligen…