Rameswar Panda
YOU?
Author Swipe
View article: TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments Open
Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open-source community is constrained by the lack of high quality permissively licensed tool-agentic trainin…
View article: FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference Open
The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for comp…
View article: PaTH Attention: Position Encoding via Accumulating Householder Transformations
PaTH Attention: Position Encoding via Accumulating Householder Transformations Open
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Ro…
View article: Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence Open
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instru…
View article: Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study Open
The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges.…
View article: Calibrating Expressions of Certainty
Calibrating Expressions of Certainty Open
We present a novel approach to calibrating linguistic expressions of certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to c…
View article: SITAR: Semi-supervised Image Transformer for Action Recognition
SITAR: Semi-supervised Image Transformer for Action Recognition Open
Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep $3$D transfor…
View article: Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler Open
Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperpa…
View article: Scaling Granite Code Models to 128K Context
Scaling Granite Code Models to 128K Context Open
This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continua…
View article: The infrastructure powering IBM's Gen AI model development
The infrastructure powering IBM's Gen AI model development Open
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and f…
View article: Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks Open
Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to re…
View article: Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts Open
We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constr…
View article: Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention Open
Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence…
View article: Granite Code Models: A Family of Open Foundation Models for Code Intelligence
Granite Code Models: A Family of Open Foundation Models for Code Intelligence Open
Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LL…
View article: Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models Open
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally req…
View article: Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization Open
We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, t…
View article: Scattered Mixture-of-Experts Implementation
Scattered Mixture-of-Experts Implementation Open
We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon existing implementations, and overcoming some of the limitations to improve inference and training speed, and memory footprint. Th…
View article: Data Engineering for Scaling Language Models to 128K Context
Data Engineering for Scaling Language Models to 128K Context Open
We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitr…
View article: API Pack: A Massive Multi-Programming Language Dataset for API Call Generation
API Pack: A Massive Multi-Programming Language Dataset for API Call Generation Open
We introduce API Pack, a massive multi-programming language dataset containing over one million instruction-API calls for improving the API call generation capabilities of large language models. Our evaluation highlights three key findings…
View article: Diversity Measurement and Subset Selection for Instruction Tuning Datasets
Diversity Measurement and Subset Selection for Instruction Tuning Datasets Open
We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of ta…
View article: Gated Linear Attention Transformers with Hardware-Efficient Training
Gated Linear Attention Transformers with Hardware-Efficient Training Open
Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention genera…
View article: Learning Human Action Recognition Representations Without Real Humans
Learning Human Action Recognition Representations Without Real Humans Open
Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with iss…
View article: LangNav: Language as a Perceptual Representation for Navigation
LangNav: Language as a Perceptual Representation for Navigation Open
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert …
View article: Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models Open
Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the …
View article: Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Open
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural lang…
View article: Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models
Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models Open
Large pretrained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constraine…
View article: MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge Open
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exc…
View article: Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning
Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning Open
Prompt tuning, in which a base pretrained model is adapted to each task via conditioning on learned prompt vectors, has emerged as a promising approach for efficiently adapting large language models to multiple downstream tasks. However, e…