FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2502.15804
KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention heads across GPUs using data parallelism to mitigate load imbalance. Our experiments on popular models, including LLaMA 70b and Mistral 24b model, demonstrate that FairKV increases throughput by 1.66x compared to standard tensor parallelism inference. Our code will be released as open source upon acceptance.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2502.15804
- https://arxiv.org/pdf/2502.15804
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414835297
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414835297Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2502.15804Digital Object Identifier
- Title
-
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU InferenceWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-02-19Full publication date if available
- Authors
-
Bin Zhao, Ke Cheng, Ao Yuan, Ye Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian YuList of authors in order
- Landing page
-
https://arxiv.org/abs/2502.15804Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2502.15804Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2502.15804Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414835297 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2502.15804 |
| ids.doi | https://doi.org/10.48550/arxiv.2502.15804 |
| ids.openalex | https://openalex.org/W4414835297 |
| fwci | |
| type | preprint |
| title | FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10036 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9740999937057495 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Advanced Neural Network Applications |
| topics[1].id | https://openalex.org/T11181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9111999869346619 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1705 |
| topics[1].subfield.display_name | Computer Networks and Communications |
| topics[1].display_name | Advanced Data Storage Technologies |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2502.15804 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2502.15804 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2502.15804 |
| locations[1].id | doi:10.48550/arxiv.2502.15804 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2502.15804 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5101811603 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-2544-5263 |
| authorships[0].author.display_name | Bin Zhao |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhao, Bingzhe |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5016780746 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-0336-6916 |
| authorships[1].author.display_name | Ke Cheng |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Cheng, Ke |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5005432115 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-8558-5604 |
| authorships[2].author.display_name | Ao Yuan |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Yuan, Aomufei |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5084754062 |
| authorships[3].author.orcid | https://orcid.org/0009-0003-5474-9156 |
| authorships[3].author.display_name | Ye Tian |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Tian, Yuxuan |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5119852630 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Ruiguang Zhong |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhong, Ruiguang |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5112303004 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-2384-1454 |
| authorships[5].author.display_name | Chengchen Hu |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Hu, Chengchen |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5115597097 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Tong Yang |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Yang, Tong |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5024112563 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Lian Yu |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Yu, Lian |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2502.15804 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10036 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9740999937057495 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Advanced Neural Network Applications |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2502.15804 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2502.15804 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2502.15804 |
| primary_location.id | pmh:oai:arXiv.org:2502.15804 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2502.15804 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2502.15804 |
| publication_date | 2025-02-19 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 88, 115 |
| abstract_inverted_index.In | 82 |
| abstract_inverted_index.KV | 0, 20, 31, 44, 103 |
| abstract_inverted_index.an | 23 |
| abstract_inverted_index.as | 73, 161 |
| abstract_inverted_index.at | 11 |
| abstract_inverted_index.be | 159 |
| abstract_inverted_index.by | 148 |
| abstract_inverted_index.in | 3, 54, 99 |
| abstract_inverted_index.is | 111 |
| abstract_inverted_index.of | 14, 109, 118 |
| abstract_inverted_index.on | 133 |
| abstract_inverted_index.to | 7, 65, 91, 127, 151 |
| abstract_inverted_index.we | 58, 85 |
| abstract_inverted_index.24b | 141 |
| abstract_inverted_index.70b | 138 |
| abstract_inverted_index.Our | 131, 156 |
| abstract_inverted_index.The | 106 |
| abstract_inverted_index.aim | 6 |
| abstract_inverted_index.and | 25, 139 |
| abstract_inverted_index.for | 47 |
| abstract_inverted_index.the | 12, 43 |
| abstract_inverted_index.GPUs | 75, 123 |
| abstract_inverted_index.code | 157 |
| abstract_inverted_index.core | 107 |
| abstract_inverted_index.data | 125 |
| abstract_inverted_index.each | 48 |
| abstract_inverted_index.fair | 93 |
| abstract_inverted_index.load | 67, 129 |
| abstract_inverted_index.open | 162 |
| abstract_inverted_index.some | 74 |
| abstract_inverted_index.such | 61 |
| abstract_inverted_index.that | 40, 60, 144 |
| abstract_inverted_index.this | 83 |
| abstract_inverted_index.upon | 164 |
| abstract_inverted_index.when | 69 |
| abstract_inverted_index.will | 158 |
| abstract_inverted_index.1.66x | 149 |
| abstract_inverted_index.LLaMA | 137 |
| abstract_inverted_index.among | 96 |
| abstract_inverted_index.cache | 1, 21, 32, 45, 104 |
| abstract_inverted_index.head, | 50 |
| abstract_inverted_index.heads | 98, 121 |
| abstract_inverted_index.leads | 64 |
| abstract_inverted_index.small | 116 |
| abstract_inverted_index.usage | 95 |
| abstract_inverted_index.using | 124 |
| abstract_inverted_index.which | 113 |
| abstract_inverted_index.while | 78 |
| abstract_inverted_index.FairKV | 110, 145 |
| abstract_inverted_index.across | 122 |
| abstract_inverted_index.adjust | 42 |
| abstract_inverted_index.become | 76 |
| abstract_inverted_index.budget | 46 |
| abstract_inverted_index.ensure | 92 |
| abstract_inverted_index.making | 19 |
| abstract_inverted_index.memory | 17, 94 |
| abstract_inverted_index.method | 89 |
| abstract_inverted_index.model, | 142 |
| abstract_inverted_index.models | 5 |
| abstract_inverted_index.others | 79 |
| abstract_inverted_index.paper, | 84 |
| abstract_inverted_index.reduce | 8 |
| abstract_inverted_index.remain | 80 |
| abstract_inverted_index.source | 163 |
| abstract_inverted_index.subset | 117 |
| abstract_inverted_index.tensor | 153 |
| abstract_inverted_index.topic. | 28 |
| abstract_inverted_index.usage, | 18 |
| abstract_inverted_index.FairKV, | 87 |
| abstract_inverted_index.Mistral | 140 |
| abstract_inverted_index.expense | 13 |
| abstract_inverted_index.methods | 34 |
| abstract_inverted_index.models, | 135 |
| abstract_inverted_index.observe | 59 |
| abstract_inverted_index.popular | 26, 134 |
| abstract_inverted_index.propose | 86 |
| abstract_inverted_index.systems | 100 |
| abstract_inverted_index.However, | 57 |
| abstract_inverted_index.compared | 150 |
| abstract_inverted_index.designed | 90 |
| abstract_inverted_index.mitigate | 128 |
| abstract_inverted_index.per-head | 37 |
| abstract_inverted_index.released | 160 |
| abstract_inverted_index.research | 27 |
| abstract_inverted_index.standard | 152 |
| abstract_inverted_index.Recently, | 29 |
| abstract_inverted_index.achieving | 51 |
| abstract_inverted_index.attention | 49, 97, 120 |
| abstract_inverted_index.deploying | 70 |
| abstract_inverted_index.employing | 101 |
| abstract_inverted_index.excellent | 52 |
| abstract_inverted_index.imbalance | 68 |
| abstract_inverted_index.implement | 35 |
| abstract_inverted_index.important | 24 |
| abstract_inverted_index.including | 136 |
| abstract_inverted_index.increased | 16 |
| abstract_inverted_index.increases | 146 |
| abstract_inverted_index.multi-GPU | 71 |
| abstract_inverted_index.redundant | 9 |
| abstract_inverted_index.technique | 108 |
| abstract_inverted_index.algorithms | 39 |
| abstract_inverted_index.allocation | 38 |
| abstract_inverted_index.imbalance. | 130 |
| abstract_inverted_index.imbalanced | 62, 102 |
| abstract_inverted_index.inference, | 72 |
| abstract_inverted_index.inference. | 155 |
| abstract_inverted_index.replicates | 114 |
| abstract_inverted_index.scenarios. | 56 |
| abstract_inverted_index.single-GPU | 55 |
| abstract_inverted_index.techniques | 2 |
| abstract_inverted_index.throughput | 147 |
| abstract_inverted_index.Transformer | 4 |
| abstract_inverted_index.acceptance. | 165 |
| abstract_inverted_index.compression | 22, 33, 63 |
| abstract_inverted_index.demonstrate | 143 |
| abstract_inverted_index.dynamically | 41 |
| abstract_inverted_index.experiments | 132 |
| abstract_inverted_index.imbalanced, | 36 |
| abstract_inverted_index.parallelism | 126, 154 |
| abstract_inverted_index.performance | 53 |
| abstract_inverted_index.significant | 66 |
| abstract_inverted_index.compression. | 105 |
| abstract_inverted_index.computations | 10 |
| abstract_inverted_index.overburdened | 77 |
| abstract_inverted_index.Fair-Copying, | 112 |
| abstract_inverted_index.substantially | 15 |
| abstract_inverted_index.underutilized. | 81 |
| abstract_inverted_index.memory-intensive | 119 |
| abstract_inverted_index.state-of-the-art | 30 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |