Explanipedia

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments Open

Zhangchen Xu, Adriana Meza Soria, Shawn Zheng Kai Tan, Abhishek Roy, Ashish Sunil Agrawal , et al. · 2025

Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open-source community is constrained by the lack of high quality permissively licensed tool-agentic trainin…

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference Open

Aniruddha Nrusimha, William T. Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda , et al. · 2025

The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for comp…

PaTH Attention: Position Encoding via Accumulating Householder Transformations Open

Songlin Yang, Yikang Shen, Shawn Zheng Kai Tan, Mayank Mishra, Rameswar Panda , et al. · 2025

The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Ro…

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence Open

Granite Vision Team, Leonid Karlinsky, Assaf Arbelle, A Daniels, Ahmed Nassar , et al. · 2025

Computer science Geology

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instru…

Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study Open

Shawn Tan, Yikang Shen, Yang Song-lin, Aaron Courville, Rameswar Panda · 2024

Business

The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges.…

Calibrating Expressions of Certainty Open

Peiqi Wang, Binh Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda , et al. · 2024

Computer science Mathematics Philosophy

We present a novel approach to calibrating linguistic expressions of certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to c…

SITAR: Semi-supervised Image Transformer for Action Recognition Open

Owais Iqbal, Omprakash Chakraborty, Aftab Hussain, Rameswar Panda, Abir Das · 2024

Computer science Engineering

Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep $3$D transfor…

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler Open

Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan , et al. · 2024

Computer science Physics

Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperpa…

Scaling Granite Code Models to 128K Context Open

Matt Stallone, Vaibhav Saxena, Leonid Karlinsky, Bridget McGinn, Tim Bula , et al. · 2024

Computer science Geology Mathematics

This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continua…

The infrastructure powering IBM's Gen AI model development Open

Talia S Gershon, Seetharami Seelam, Brian Belgodere, Milton Bonilla, Lan Hoàng , et al. · 2024

Business Computer science Physics

AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and f…

Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks Open

Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone , et al. · 2024

Computer science Psychology Economics

Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to re…

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts Open

Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen , et al. · 2024

Computer science Philosophy

We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constr…

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention Open

William T. Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly · 2024

Computer science

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence…

Granite Code Models: A Family of Open Foundation Models for Code Intelligence Open

Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad , et al. · 2024

Computer science Geography

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LL…

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models Open

Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang , et al. · 2024

Computer science Geography

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally req…

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization Open

Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda , et al. · 2024

Computer science Mathematics

We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, t…

Scattered Mixture-of-Experts Implementation Open

Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville · 2024

Computer science Business

We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon existing implementations, and overcoming some of the limitations to improve inference and training speed, and memory footprint. Th…

Data Engineering for Scaling Language Models to 128K Context Open

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi , et al. · 2024

Computer science Mathematics Geography

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitr…

API Pack: A Massive Multi-Programming Language Dataset for API Call Generation Open

Zhen Guo, Adriana Meza Soria, Wei Sun, Yikang Shen, Rameswar Panda · 2024

Computer science

We introduce API Pack, a massive multi-programming language dataset containing over one million instruction-API calls for improving the API call generation capabilities of large language models. Our evaluation highlights three key findings…

Diversity Measurement and Subset Selection for Instruction Tuning Datasets Open

Peiqi Wang, Yikang Shen, Zhen Guo, Matthew Stallone, Yoon Kim , et al. · 2024

Computer science Political science

We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of ta…

Gated Linear Attention Transformers with Hardware-Efficient Training Open

Yang Song-lin, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim · 2023

Computer science Engineering Geography

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention genera…

Learning Human Action Recognition Representations Without Real Humans Open

Huanqi Zhong, Samarth Mishra, Donghyun Kim, SouYoung Jin, Rameswar Panda , et al. · 2023

Computer science Economics Geography

Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with iss…

LangNav: Language as a Perceptual Representation for Navigation Open

Bowen Pan, Rameswar Panda, SouYoung Jin, Rogério Feris, Aude Oliva , et al. · 2023

Computer science Psychology Geography

We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert …

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models Open

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim , et al. · 2023

Computer science Mathematics Engineering

Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the …

Going Beyond Nouns With Vision & Language Models Using Synthetic Data Open

Paola Cascante-Bonilla, Khaled Shehada, James Smith, Sivan Doveh, Donghyun Kim , et al. · 2023

Computer science Psychology Philosophy

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural lang…

Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models Open

Aashka Trivedi, Takuma Udagawa, Michele Merler, Rameswar Panda, Yousef El-Kurdi , et al. · 2023

Computer science Chemistry

Large pretrained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constraine…

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge Open

Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Koziński , et al. · 2023

Computer science Chemistry

Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exc…

Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning Open

Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rogério Feris, Huan Sun , et al. · 2023

Computer science Economics Mathematics

Prompt tuning, in which a base pretrained model is adapted to each task via conditioning on learned prompt vectors, has emerged as a promising approach for efficiently adapting large language models to multiple downstream tasks. However, e…

Rameswar Panda YOU? Author Swipe