Luo Mai
YOU?
Author Swipe
View article: MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems Open
The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy…
View article: MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching Open
This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which …
View article: MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation Open
This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motio…
View article: Pushing the Boundaries of State Space Models for Image and Video Generation
Pushing the Boundaries of State Space Models for Image and Video Generation Open
While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, t…
View article: Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces
Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces Open
Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal c…
View article: GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting
GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting Open
Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training t…
View article: Mycosphere Notes 521–571: A special edition of fungal biodiversity to celebrate Kevin D. Hyde's 70th birthday and his exceptional contributions to Mycology
Mycosphere Notes 521–571: A special edition of fungal biodiversity to celebrate Kevin D. Hyde's 70th birthday and his exceptional contributions to Mycology Open
This special edition of Mycosphere Notes commemorates the 70th birthday of Kevin D. Hyde, a seminal figure in fungal taxonomy whose work has profoundly influenced the study of fungal diversity and classification. In this paper, we provide …
View article: TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models Open
Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the …
View article: MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems Open
The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy…
View article: ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models Open
This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers, Serve…
View article: Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections
Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections Open
Deep learning (DL) jobs use multi-dimensional parallelism, i.e. combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may experience changes to their GPU allocation: (i) resource elastici…
View article: GEAR: A GPU-Centric Experience Replay System for Large Reinforcement Learning Models
GEAR: A GPU-Centric Experience Replay System for Large Reinforcement Learning Models Open
This paper introduces a distributed, GPU-centric experience replay system, GEAR, designed to perform scalable reinforcement learning (RL) with large sequence models (such as transformers). With such models, existing systems such as Reverb …
View article: Large Sequence Models for Sequential Decision-Making: A Survey
Large Sequence Models for Sequential Decision-Making: A Survey Open
Transformer architectures have facilitated the development of large-scale and general-purpose sequence models for prediction tasks in natural language processing and computer vision, e.g., GPT-3 and Swin Transformer. Although originally de…
View article: Quiver: Supporting GPUs for Low-Latency, High-Throughput GNN Serving with Workload Awareness
Quiver: Supporting GPUs for Low-Latency, High-Throughput GNN Serving with Workload Awareness Open
Systems for serving inference requests on graph neural networks (GNN) must combine low latency with high throughout, but they face irregular computation due to skew in the number of sampled graph nodes and aggregated GNN features. This mak…
View article: TorchOpt: An Efficient Library for Differentiable Optimization
TorchOpt: An Efficient Library for Differentiable Optimization Open
Recent years have witnessed the booming of various differentiable optimization algorithms. These algorithms exhibit different execution patterns, and their execution needs massive computational resources that go beyond a single CPU and GPU…
View article: A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning
A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning Open
Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures wherein the outer-loop meta-learner guides the inner-loop gradient-based reinforcement learner to achieve fast adaptations. In this paper, we d…
View article: MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment
MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment Open
Large-scale Bundle Adjustment (BA) requires massive memory and computation resources which are difficult to be fulfilled by existing BA libraries. In this paper, we propose MegBA, a GPU-based distributed BA library. MegBA can provide massi…
View article: Fast and Flexible Human Pose Estimation with HyperPose
Fast and Flexible Human Pose Estimation with HyperPose Open
Estimating human pose is an important yet challenging task in multimedia\napplications. Existing pose estimation libraries target reproducing standard\npose estimation algorithms. When it comes to customising these algorithms for\nreal-wor…
View article: Parallel Fully Convolutional Network for Semantic Segmentation
Parallel Fully Convolutional Network for Semantic Segmentation Open
Fully convolutional networks (FCNs) have been widely applied for dense classification tasks such as semantic segmentation. As a large number of works based on FCNs are proposed, various semantic segmentation models have been improved signi…
View article: Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo
Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo Open
Resource provisioning in multi-tenant stream processing systems faces the dual challenges of keeping resource utilization high (without over-provisioning), and ensuring performance isolation. In our common production use cases, where strea…
View article: Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing\n with Cameo
Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing\n with Cameo Open
Resource provisioning in multi-tenant stream processing systems faces the\ndual challenges of keeping resource utilization high (without\nover-provisioning), and ensuring performance isolation. In our common\nproduction use cases, where st…
View article: Efficient Reinforcement Learning Development with RLzoo
Efficient Reinforcement Learning Development with RLzoo Open
Many researchers and developers are exploring for adopting Deep Reinforcement Learning (DRL) techniques in their applications. They however often find such an adoption challenging. Existing DRL libraries provide poor support for prototypin…
View article: RLzoo: A Comprehensive and Adaptive Reinforcement Learning Library.
RLzoo: A Comprehensive and Adaptive Reinforcement Learning Library. Open
Recently, we have seen a rapidly growing adoption of Deep Reinforcement Learning (DRL) technologies. Fully achieving the promise of these technologies in practice is, however, extremely difficult. Users have to invest tremendous efforts in…
View article: KungFu: Making Training in Distributed Machine Learning Adaptive
KungFu: Making Training in Distributed Machine Learning Adaptive Open
When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must con-figure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence…
View article: CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers
CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers Open
Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of …
View article: Towards efficient big data processing in data centres
Towards efficient big data processing in data centres Open
Large data processing systems require a high degree of coordination, and exhibit network bottlenecks due to massive communication data. This motivates my PhD study to propose system control mechanisms that improve monitoring and coordinati…
View article: TensorLayer: A Versatile Library for Efficient Deep Learning Development
TensorLayer: A Versatile Library for Efficient Deep Learning Development Open
Deep learning has enabled major advances in the fields of computer vision, natural language processing, and multimedia among many others. Developing a deep learning system is arduous and complex, as it involves constructing neural network …
View article: Emu: Rapid Prototyping of Networking Services
Emu: Rapid Prototyping of Networking Services Open
Due to their performance and flexibility, FPGAs are an attractive platform for the execution of network functions. It has been a challenge for a long time though to make FPGA programming accessible to a large audience of developers. An app…
View article: Extending programs with debug-related features, with application to hardware development
Extending programs with debug-related features, with application to hardware development Open
The capacity and programmability of reconfigurable hardware such as FPGAs has improved steadily over the years, but they do not readily provide any mechanisms for monitoring or debugging running programs. Such mechanisms need to be written…