FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference Article Swipe

PDF

Bin Zhao , Ke Cheng , Ao Yuan , Ye Tian , Ruiguang Zhong , Chengchen Hu , Tong Yang , Lian Yu ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2502.15804

KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention heads across GPUs using data parallelism to mitigate load imbalance. Our experiments on popular models, including LLaMA 70b and Mistral 24b model, demonstrate that FairKV increases throughput by 1.66x compared to standard tensor parallelism inference. Our code will be released as open source upon acceptance.

Related Topics

Molly Jong-Fast

Rock Balancing

Fast X

The Fast And The Furious (2001 Film)

Fast Times At Ridgemont High

Angus Barbieri's Fast

Alexia Fast

Fast & Furious (2009 Film)

2 Fast 2 Furious

Fast Five

Fast & Furious 6

Thinking, Fast And Slow

Howard Fast

Fast Car

Fast Food

How To Sell Drugs Online (Fast)

Fast Fashion

The Fast Show

Fast-Food Restaurant

Cb90-Class Fast Assault Craft

Fast-Moving Consumer Goods

Fast (Video Game Series)

Too Fast For Love

Cache (Computing)

Concepts

No concepts available.

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2502.15804
PDF: https://arxiv.org/pdf/2502.15804
OA Status: green
OpenAlex ID: https://openalex.org/W4414835297

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4414835297

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2502.15804

Digital Object Identifier
Title: FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-02-19

Full publication date if available
Authors: Bin Zhao, Ke Cheng, Ao Yuan, Ye Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian Yu

List of authors in order
Landing page: https://arxiv.org/abs/2502.15804

Publisher landing page
PDF URL: https://arxiv.org/pdf/2502.15804

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2502.15804

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4414835297
doi	https://doi.org/10.48550/arxiv.2502.15804
ids.doi	https://doi.org/10.48550/arxiv.2502.15804
ids.openalex	https://openalex.org/W4414835297
fwci
type	preprint
title	FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10036
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9740999937057495
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1707
topics[0].subfield.display_name	Computer Vision and Pattern Recognition
topics[0].display_name	Advanced Neural Network Applications
topics[1].id	https://openalex.org/T11181
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9111999869346619
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1705
topics[1].subfield.display_name	Computer Networks and Communications
topics[1].display_name	Advanced Data Storage Technologies
is_xpac	False
apc_list
apc_paid
language	en
locations[0].id	pmh:oai:arXiv.org:2502.15804
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2502.15804
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2502.15804
locations[1].id	doi:10.48550/arxiv.2502.15804
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2502.15804
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5101811603
authorships[0].author.orcid	https://orcid.org/0000-0002-2544-5263
authorships[0].author.display_name	Bin Zhao
authorships[0].author_position	first
authorships[0].raw_author_name	Zhao, Bingzhe
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5016780746
authorships[1].author.orcid	https://orcid.org/0000-0003-0336-6916
authorships[1].author.display_name	Ke Cheng
authorships[1].author_position	middle
authorships[1].raw_author_name	Cheng, Ke
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5005432115
authorships[2].author.orcid	https://orcid.org/0000-0002-8558-5604
authorships[2].author.display_name	Ao Yuan
authorships[2].author_position	middle
authorships[2].raw_author_name	Yuan, Aomufei
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5084754062
authorships[3].author.orcid	https://orcid.org/0009-0003-5474-9156
authorships[3].author.display_name	Ye Tian
authorships[3].author_position	middle
authorships[3].raw_author_name	Tian, Yuxuan
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5119852630
authorships[4].author.orcid
authorships[4].author.display_name	Ruiguang Zhong
authorships[4].author_position	middle
authorships[4].raw_author_name	Zhong, Ruiguang
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5112303004
authorships[5].author.orcid	https://orcid.org/0000-0003-2384-1454
authorships[5].author.display_name	Chengchen Hu
authorships[5].author_position	middle
authorships[5].raw_author_name	Hu, Chengchen
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5115597097
authorships[6].author.orcid
authorships[6].author.display_name	Tong Yang
authorships[6].author_position	middle
authorships[6].raw_author_name	Yang, Tong
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5024112563
authorships[7].author.orcid
authorships[7].author.display_name	Lian Yu
authorships[7].author_position	last
authorships[7].raw_author_name	Yu, Lian
authorships[7].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2502.15804
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10036
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9740999937057495
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1707
primary_topic.subfield.display_name	Computer Vision and Pattern Recognition
primary_topic.display_name	Advanced Neural Network Applications
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2502.15804
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2502.15804
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2502.15804
primary_location.id	pmh:oai:arXiv.org:2502.15804
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2502.15804
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2502.15804
publication_date	2025-02-19
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	88, 115
abstract_inverted_index.In	82
abstract_inverted_index.KV	0, 20, 31, 44, 103
abstract_inverted_index.an	23
abstract_inverted_index.as	73, 161
abstract_inverted_index.at	11
abstract_inverted_index.be	159
abstract_inverted_index.by	148
abstract_inverted_index.in	3, 54, 99
abstract_inverted_index.is	111
abstract_inverted_index.of	14, 109, 118
abstract_inverted_index.on	133
abstract_inverted_index.to	7, 65, 91, 127, 151
abstract_inverted_index.we	58, 85
abstract_inverted_index.24b	141
abstract_inverted_index.70b	138
abstract_inverted_index.Our	131, 156
abstract_inverted_index.The	106
abstract_inverted_index.aim	6
abstract_inverted_index.and	25, 139
abstract_inverted_index.for	47
abstract_inverted_index.the	12, 43
abstract_inverted_index.GPUs	75, 123
abstract_inverted_index.code	157
abstract_inverted_index.core	107
abstract_inverted_index.data	125
abstract_inverted_index.each	48
abstract_inverted_index.fair	93
abstract_inverted_index.load	67, 129
abstract_inverted_index.open	162
abstract_inverted_index.some	74
abstract_inverted_index.such	61
abstract_inverted_index.that	40, 60, 144
abstract_inverted_index.this	83
abstract_inverted_index.upon	164
abstract_inverted_index.when	69
abstract_inverted_index.will	158
abstract_inverted_index.1.66x	149
abstract_inverted_index.LLaMA	137
abstract_inverted_index.among	96
abstract_inverted_index.cache	1, 21, 32, 45, 104
abstract_inverted_index.head,	50
abstract_inverted_index.heads	98, 121
abstract_inverted_index.leads	64
abstract_inverted_index.small	116
abstract_inverted_index.usage	95
abstract_inverted_index.using	124
abstract_inverted_index.which	113
abstract_inverted_index.while	78
abstract_inverted_index.FairKV	110, 145
abstract_inverted_index.across	122
abstract_inverted_index.adjust	42
abstract_inverted_index.become	76
abstract_inverted_index.budget	46
abstract_inverted_index.ensure	92
abstract_inverted_index.making	19
abstract_inverted_index.memory	17, 94
abstract_inverted_index.method	89
abstract_inverted_index.model,	142
abstract_inverted_index.models	5
abstract_inverted_index.others	79
abstract_inverted_index.paper,	84
abstract_inverted_index.reduce	8
abstract_inverted_index.remain	80
abstract_inverted_index.source	163
abstract_inverted_index.subset	117
abstract_inverted_index.tensor	153
abstract_inverted_index.topic.	28
abstract_inverted_index.usage,	18
abstract_inverted_index.FairKV,	87
abstract_inverted_index.Mistral	140
abstract_inverted_index.expense	13
abstract_inverted_index.methods	34
abstract_inverted_index.models,	135
abstract_inverted_index.observe	59
abstract_inverted_index.popular	26, 134
abstract_inverted_index.propose	86
abstract_inverted_index.systems	100
abstract_inverted_index.However,	57
abstract_inverted_index.compared	150
abstract_inverted_index.designed	90
abstract_inverted_index.mitigate	128
abstract_inverted_index.per-head	37
abstract_inverted_index.released	160
abstract_inverted_index.research	27
abstract_inverted_index.standard	152
abstract_inverted_index.Recently,	29
abstract_inverted_index.achieving	51
abstract_inverted_index.attention	49, 97, 120
abstract_inverted_index.deploying	70
abstract_inverted_index.employing	101
abstract_inverted_index.excellent	52
abstract_inverted_index.imbalance	68
abstract_inverted_index.implement	35
abstract_inverted_index.important	24
abstract_inverted_index.including	136
abstract_inverted_index.increased	16
abstract_inverted_index.increases	146
abstract_inverted_index.multi-GPU	71
abstract_inverted_index.redundant	9
abstract_inverted_index.technique	108
abstract_inverted_index.algorithms	39
abstract_inverted_index.allocation	38
abstract_inverted_index.imbalance.	130
abstract_inverted_index.imbalanced	62, 102
abstract_inverted_index.inference,	72
abstract_inverted_index.inference.	155
abstract_inverted_index.replicates	114
abstract_inverted_index.scenarios.	56
abstract_inverted_index.single-GPU	55
abstract_inverted_index.techniques	2
abstract_inverted_index.throughput	147
abstract_inverted_index.Transformer	4
abstract_inverted_index.acceptance.	165
abstract_inverted_index.compression	22, 33, 63
abstract_inverted_index.demonstrate	143
abstract_inverted_index.dynamically	41
abstract_inverted_index.experiments	132
abstract_inverted_index.imbalanced,	36
abstract_inverted_index.parallelism	126, 154
abstract_inverted_index.performance	53
abstract_inverted_index.significant	66
abstract_inverted_index.compression.	105
abstract_inverted_index.computations	10
abstract_inverted_index.overburdened	77
abstract_inverted_index.Fair-Copying,	112
abstract_inverted_index.substantially	15
abstract_inverted_index.underutilized.	81
abstract_inverted_index.memory-intensive	119
abstract_inverted_index.state-of-the-art	30
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	8
citation_normalized_percentile