Tong Geng
YOU?
Author Swipe
View article: Probabilistic Token Alignment for Large Language Model Fusion
Probabilistic Token Alignment for Large Language Model Fusion Open
Training large language models (LLMs) from scratch can yield models with unique functionalities and strengths, but it is costly and often leads to redundant capabilities. A more cost-effective alternative is to fuse existing pre-trained LL…
View article: VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation
VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation Open
Autoregressive (AR) models have recently shown strong performance in image generation, where a critical component is the visual tokenizer (VT) that maps continuous pixel inputs to discrete token sequences. The quality of the VT largely def…
View article: ACiS: Complex Processing in the Switch Fabric
ACiS: Complex Processing in the Switch Fabric Open
For the last three decades a core use of FPGAs has been for processing communication: FPGA-based SmartNICs are in widespread use from the datacenter to IoT. Augmenting switches with FPGAs, however, has been less studied, but has numerous a…
View article: Diff-PIC: Revolutionizing Particle-In-Cell Nuclear Fusion Simulation with Diffusion Models
Diff-PIC: Revolutionizing Particle-In-Cell Nuclear Fusion Simulation with Diffusion Models Open
The rapid development of AI highlights the pressing need for sustainable energy, a critical global challenge for decades. Nuclear fusion, generally seen as an ultimate solution, has been the focus of intensive research for nearly a century…
View article: A systematic evaluation of computational methods for cell segmentation
A systematic evaluation of computational methods for cell segmentation Open
Cell segmentation is a fundamental task in analyzing biomedical images. Many computational methods have been developed for cell segmentation and instance segmentation, but their performances are not well understood in various scenarios. We…
View article: Inertial Confinement Fusion Forecasting via Large Language Models
Inertial Confinement Fusion Forecasting via Large Language Models Open
Controlled fusion energy is deemed pivotal for the advancement of human civilization. In this study, we introduce $\textbf{LPI-LLM}$, a novel integration of Large Language Models (LLMs) with classical reservoir computing paradigms tailored…
View article: Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression
Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression Open
DLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. The large size of DLRM models, however, necessitates the use of multiple devices/GPUs for efficient training. …
View article: Prototypical Transformer as Unified Motion Learners
Prototypical Transformer as Unified Motion Learners Open
In this work, we introduce the Prototypical Transformer (ProtoFormer), a general and unified framework that approaches various motion tasks from a prototype perspective. ProtoFormer seamlessly integrates prototype learning with Transformer…
View article: SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications
SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications Open
Communication switches have sometimes been augmented to process collectives, e.g., in the IBM BlueGene and Mellanox SHArP switches. In this work, we find that there is a great acceleration opportunity through the further augmentation of sw…
View article: Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs
Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs Open
The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Tr…
View article: FPGA-Accelerated Range-Limited Molecular Dynamics
FPGA-Accelerated Range-Limited Molecular Dynamics Open
Long timescale Molecular Dynamics (MD) simulation of small molecules is crucial in drug design and basic science. To accelerate a small data set that is executed for a large number of iterations, high-efficiency is required. Recent work in…
View article: Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs
Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs Open
The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Tr…
View article: SUPPORTING ENERGY-BASED LEARNING WITH AN ISING MACHINE SUBSTRATE: A CASE STUDY ON RBM
SUPPORTING ENERGY-BASED LEARNING WITH AN ISING MACHINE SUBSTRATE: A CASE STUDY ON RBM Open
Nature apparently does a lot of computation constantly. If we can harness some of that computation at an appropriate level, we can potentially perform certain type of computation (much) faster and more efficiently than we can do with a von…
View article: LinGCN: Structural Linearized Graph Convolutional Network for Homomorphically Encrypted Inference
LinGCN: Structural Linearized Graph Convolutional Network for Homomorphically Encrypted Inference Open
The growth of Graph Convolution Network (GCN) model sizes has revolutionized numerous applications, surpassing human performance in areas such as personal healthcare and financial systems. The deployment of GCNs in the cloud raises privacy…
View article: ClusterFormer: Clustering As A Universal Visual Learner
ClusterFormer: Clustering As A Universal Visual Learner Open
This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanis…
View article: Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks
Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks Open
Graph Convolutional Networks (GCNs) are pivotal in extracting latent information from graph data across various domains, yet their acceleration on mainstream GPUs is challenged by workload imbalance and memory access irregularity. To addre…
View article: Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors
Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors Open
Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs. Lega…
View article: A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining
A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining Open
Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkabl…
View article: CEAZ
CEAZ Open
As HPC systems continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve the…
View article: APNN-TC
APNN-TC Open
Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on…
View article: Optimizing FPGA-based Accelerator Design for Large-Scale Molecular Similarity Search
Optimizing FPGA-based Accelerator Design for Large-Scale Molecular Similarity Search Open
Molecular similarity search has been widely used in drug discovery to identify structurally similar compounds from large molecular databases rapidly. With the increasing size of chemical libraries, there is growing interest in the efficien…
View article: Binary Complex Neural Network Acceleration on FPGA
Binary Complex Neural Network Acceleration on FPGA Open
Being able to learn from complex data with phase information is imperative for many signal processing applications. Today' s real-valued deep neural networks (DNNs) have shown efficiency in latent information analysis but fall short when a…
View article: CEAZ: Accelerating Parallel I/O via Hardware-Algorithm Co-Designed Adaptive Lossy Compression
CEAZ: Accelerating Parallel I/O via Hardware-Algorithm Co-Designed Adaptive Lossy Compression Open
As HPC systems continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve the…
View article: CEAZ: Accelerating Parallel I/O via Hardware-Algorithm Co-Design of Efficient and Adaptive Lossy Compression.
CEAZ: Accelerating Parallel I/O via Hardware-Algorithm Co-Design of Efficient and Adaptive Lossy Compression. Open
As supercomputers continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve …
View article: APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores
APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores Open
Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on…
View article: ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing
ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing Open
The next generation HPC and data centers are likely to be reconfigurable and data-centric due to the trend of hardware specialization and the emergence of data-driven applications. In this work, we propose ARENA – an asynchronous reconfigu…
View article: BCNN: Binary Complex Neural Network
BCNN: Binary Complex Neural Network Open
Binarized neural networks, or BNNs, show great promise in edge-side applications with resource limited hardware, but raise the concerns of reduced accuracy. Motivated by the complex neural networks, in this paper we introduce complex repre…