AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2410.13212
Large language models have shown exceptional capabilities in a wide range of tasks, such as text generation and video generation, among others. However, due to their massive parameter count, these models often require substantial storage space, imposing significant constraints on the machines deploying LLMs. To overcome this limitation, one research direction proposes to compress the models using integer replacements for floating-point numbers, in a process known as Quantization. Some recent studies suggest quantizing the key and value cache (KV Cache) of LLMs, and designing quantization techniques that treat the key and value matrices equivalently. This work delves deeper into the asymmetric structural roles of KV Cache, a phenomenon where the transformer's output loss is more sensitive to the quantization of key matrices. We conduct a systematic examination of the attention output error resulting from key and value quantization. The phenomenon inspires us to propose an asymmetric quantization strategy. Our approach allows for 1-bit quantization of the KV cache by implementing distinct configurations for key and value matrices. We carry out experiments across a variety of datasets, demonstrating that our proposed model allows for the quantization of up to 75% decoder layers with 1 bit, while simultaneously maintaining performance levels comparable to those of the models with floating parameters.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2410.13212
- https://arxiv.org/pdf/2410.13212
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403579817
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403579817Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2410.13212Digital Object Identifier
- Title
-
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization ConfigurationsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-10-17Full publication date if available
- Authors
-
Qian Tao, Wenyuan Yu, Jingren ZhouList of authors in order
- Landing page
-
https://arxiv.org/abs/2410.13212Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2410.13212Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2410.13212Direct OA link when available
- Concepts
-
Quantization (signal processing), Cache, Computer science, Bit (key), Layer (electronics), Parallel computing, Computer network, Algorithm, Materials science, NanotechnologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403579817 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2410.13212 |
| ids.doi | https://doi.org/10.48550/arxiv.2410.13212 |
| ids.openalex | https://openalex.org/W4403579817 |
| fwci | |
| type | preprint |
| title | AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10829 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9740999937057495 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1705 |
| topics[0].subfield.display_name | Computer Networks and Communications |
| topics[0].display_name | Interconnection Networks and Systems |
| topics[1].id | https://openalex.org/T10054 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9660999774932861 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1708 |
| topics[1].subfield.display_name | Hardware and Architecture |
| topics[1].display_name | Parallel Computing and Optimization Techniques |
| topics[2].id | https://openalex.org/T11321 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9638000130653381 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1705 |
| topics[2].subfield.display_name | Computer Networks and Communications |
| topics[2].display_name | Error Correcting Code Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C28855332 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7798011898994446 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q198099 |
| concepts[0].display_name | Quantization (signal processing) |
| concepts[1].id | https://openalex.org/C115537543 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7162013649940491 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q165596 |
| concepts[1].display_name | Cache |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.5679356455802917 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C117011727 |
| concepts[3].level | 2 |
| concepts[3].score | 0.48594987392425537 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1278488 |
| concepts[3].display_name | Bit (key) |
| concepts[4].id | https://openalex.org/C2779227376 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4208712875843048 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q6505497 |
| concepts[4].display_name | Layer (electronics) |
| concepts[5].id | https://openalex.org/C173608175 |
| concepts[5].level | 1 |
| concepts[5].score | 0.25344759225845337 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q232661 |
| concepts[5].display_name | Parallel computing |
| concepts[6].id | https://openalex.org/C31258907 |
| concepts[6].level | 1 |
| concepts[6].score | 0.21397081017494202 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q1301371 |
| concepts[6].display_name | Computer network |
| concepts[7].id | https://openalex.org/C11413529 |
| concepts[7].level | 1 |
| concepts[7].score | 0.17390650510787964 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q8366 |
| concepts[7].display_name | Algorithm |
| concepts[8].id | https://openalex.org/C192562407 |
| concepts[8].level | 0 |
| concepts[8].score | 0.12874725461006165 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q228736 |
| concepts[8].display_name | Materials science |
| concepts[9].id | https://openalex.org/C171250308 |
| concepts[9].level | 1 |
| concepts[9].score | 0.056502461433410645 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q11468 |
| concepts[9].display_name | Nanotechnology |
| keywords[0].id | https://openalex.org/keywords/quantization |
| keywords[0].score | 0.7798011898994446 |
| keywords[0].display_name | Quantization (signal processing) |
| keywords[1].id | https://openalex.org/keywords/cache |
| keywords[1].score | 0.7162013649940491 |
| keywords[1].display_name | Cache |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.5679356455802917 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/bit |
| keywords[3].score | 0.48594987392425537 |
| keywords[3].display_name | Bit (key) |
| keywords[4].id | https://openalex.org/keywords/layer |
| keywords[4].score | 0.4208712875843048 |
| keywords[4].display_name | Layer (electronics) |
| keywords[5].id | https://openalex.org/keywords/parallel-computing |
| keywords[5].score | 0.25344759225845337 |
| keywords[5].display_name | Parallel computing |
| keywords[6].id | https://openalex.org/keywords/computer-network |
| keywords[6].score | 0.21397081017494202 |
| keywords[6].display_name | Computer network |
| keywords[7].id | https://openalex.org/keywords/algorithm |
| keywords[7].score | 0.17390650510787964 |
| keywords[7].display_name | Algorithm |
| keywords[8].id | https://openalex.org/keywords/materials-science |
| keywords[8].score | 0.12874725461006165 |
| keywords[8].display_name | Materials science |
| keywords[9].id | https://openalex.org/keywords/nanotechnology |
| keywords[9].score | 0.056502461433410645 |
| keywords[9].display_name | Nanotechnology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2410.13212 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2410.13212 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2410.13212 |
| locations[1].id | doi:10.48550/arxiv.2410.13212 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2410.13212 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5032588771 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-5383-4808 |
| authorships[0].author.display_name | Qian Tao |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Tao, Qian |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5032456107 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-2917-5993 |
| authorships[1].author.display_name | Wenyuan Yu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Yu, Wenyuan |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5057864403 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-4220-2634 |
| authorships[2].author.display_name | Jingren Zhou |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Zhou, Jingren |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2410.13212 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-10-21T00:00:00 |
| display_name | AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10829 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9740999937057495 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1705 |
| primary_topic.subfield.display_name | Computer Networks and Communications |
| primary_topic.display_name | Interconnection Networks and Systems |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2411923897, https://openalex.org/W4394546135, https://openalex.org/W4285347720, https://openalex.org/W4200259850, https://openalex.org/W2333831899, https://openalex.org/W2484894494, https://openalex.org/W2367385042 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2410.13212 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2410.13212 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2410.13212 |
| primary_location.id | pmh:oai:arXiv.org:2410.13212 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2410.13212 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2410.13212 |
| publication_date | 2024-10-17 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.1 | 192 |
| abstract_inverted_index.a | 8, 63, 106, 124, 172 |
| abstract_inverted_index.KV | 104, 156 |
| abstract_inverted_index.To | 44 |
| abstract_inverted_index.We | 122, 167 |
| abstract_inverted_index.an | 144 |
| abstract_inverted_index.as | 14, 66 |
| abstract_inverted_index.by | 158 |
| abstract_inverted_index.in | 7, 62 |
| abstract_inverted_index.is | 113 |
| abstract_inverted_index.of | 11, 80, 103, 119, 127, 154, 174, 185, 202 |
| abstract_inverted_index.on | 39 |
| abstract_inverted_index.to | 24, 52, 116, 142, 187, 200 |
| abstract_inverted_index.up | 186 |
| abstract_inverted_index.us | 141 |
| abstract_inverted_index.(KV | 78 |
| abstract_inverted_index.75% | 188 |
| abstract_inverted_index.Our | 148 |
| abstract_inverted_index.The | 138 |
| abstract_inverted_index.and | 17, 75, 82, 90, 135, 164 |
| abstract_inverted_index.due | 23 |
| abstract_inverted_index.for | 59, 151, 162, 182 |
| abstract_inverted_index.key | 74, 89, 120, 134, 163 |
| abstract_inverted_index.one | 48 |
| abstract_inverted_index.our | 178 |
| abstract_inverted_index.out | 169 |
| abstract_inverted_index.the | 40, 54, 73, 88, 99, 109, 117, 128, 155, 183, 203 |
| abstract_inverted_index.Some | 68 |
| abstract_inverted_index.This | 94 |
| abstract_inverted_index.bit, | 193 |
| abstract_inverted_index.from | 133 |
| abstract_inverted_index.have | 3 |
| abstract_inverted_index.into | 98 |
| abstract_inverted_index.loss | 112 |
| abstract_inverted_index.more | 114 |
| abstract_inverted_index.such | 13 |
| abstract_inverted_index.text | 15 |
| abstract_inverted_index.that | 86, 177 |
| abstract_inverted_index.this | 46 |
| abstract_inverted_index.wide | 9 |
| abstract_inverted_index.with | 191, 205 |
| abstract_inverted_index.work | 95 |
| abstract_inverted_index.1-bit | 152 |
| abstract_inverted_index.LLMs, | 81 |
| abstract_inverted_index.LLMs. | 43 |
| abstract_inverted_index.Large | 0 |
| abstract_inverted_index.among | 20 |
| abstract_inverted_index.cache | 77, 157 |
| abstract_inverted_index.carry | 168 |
| abstract_inverted_index.error | 131 |
| abstract_inverted_index.known | 65 |
| abstract_inverted_index.model | 180 |
| abstract_inverted_index.often | 31 |
| abstract_inverted_index.range | 10 |
| abstract_inverted_index.roles | 102 |
| abstract_inverted_index.shown | 4 |
| abstract_inverted_index.their | 25 |
| abstract_inverted_index.these | 29 |
| abstract_inverted_index.those | 201 |
| abstract_inverted_index.treat | 87 |
| abstract_inverted_index.using | 56 |
| abstract_inverted_index.value | 76, 91, 136, 165 |
| abstract_inverted_index.video | 18 |
| abstract_inverted_index.where | 108 |
| abstract_inverted_index.while | 194 |
| abstract_inverted_index.Cache) | 79 |
| abstract_inverted_index.Cache, | 105 |
| abstract_inverted_index.across | 171 |
| abstract_inverted_index.allows | 150, 181 |
| abstract_inverted_index.count, | 28 |
| abstract_inverted_index.deeper | 97 |
| abstract_inverted_index.delves | 96 |
| abstract_inverted_index.layers | 190 |
| abstract_inverted_index.levels | 198 |
| abstract_inverted_index.models | 2, 30, 55, 204 |
| abstract_inverted_index.output | 111, 130 |
| abstract_inverted_index.recent | 69 |
| abstract_inverted_index.space, | 35 |
| abstract_inverted_index.tasks, | 12 |
| abstract_inverted_index.conduct | 123 |
| abstract_inverted_index.decoder | 189 |
| abstract_inverted_index.integer | 57 |
| abstract_inverted_index.massive | 26 |
| abstract_inverted_index.others. | 21 |
| abstract_inverted_index.process | 64 |
| abstract_inverted_index.propose | 143 |
| abstract_inverted_index.require | 32 |
| abstract_inverted_index.storage | 34 |
| abstract_inverted_index.studies | 70 |
| abstract_inverted_index.suggest | 71 |
| abstract_inverted_index.variety | 173 |
| abstract_inverted_index.However, | 22 |
| abstract_inverted_index.approach | 149 |
| abstract_inverted_index.compress | 53 |
| abstract_inverted_index.distinct | 160 |
| abstract_inverted_index.floating | 206 |
| abstract_inverted_index.imposing | 36 |
| abstract_inverted_index.inspires | 140 |
| abstract_inverted_index.language | 1 |
| abstract_inverted_index.machines | 41 |
| abstract_inverted_index.matrices | 92 |
| abstract_inverted_index.numbers, | 61 |
| abstract_inverted_index.overcome | 45 |
| abstract_inverted_index.proposed | 179 |
| abstract_inverted_index.proposes | 51 |
| abstract_inverted_index.research | 49 |
| abstract_inverted_index.attention | 129 |
| abstract_inverted_index.datasets, | 175 |
| abstract_inverted_index.deploying | 42 |
| abstract_inverted_index.designing | 83 |
| abstract_inverted_index.direction | 50 |
| abstract_inverted_index.matrices. | 121, 166 |
| abstract_inverted_index.parameter | 27 |
| abstract_inverted_index.resulting | 132 |
| abstract_inverted_index.sensitive | 115 |
| abstract_inverted_index.strategy. | 147 |
| abstract_inverted_index.asymmetric | 100, 145 |
| abstract_inverted_index.comparable | 199 |
| abstract_inverted_index.generation | 16 |
| abstract_inverted_index.phenomenon | 107, 139 |
| abstract_inverted_index.quantizing | 72 |
| abstract_inverted_index.structural | 101 |
| abstract_inverted_index.systematic | 125 |
| abstract_inverted_index.techniques | 85 |
| abstract_inverted_index.constraints | 38 |
| abstract_inverted_index.examination | 126 |
| abstract_inverted_index.exceptional | 5 |
| abstract_inverted_index.experiments | 170 |
| abstract_inverted_index.generation, | 19 |
| abstract_inverted_index.limitation, | 47 |
| abstract_inverted_index.maintaining | 196 |
| abstract_inverted_index.parameters. | 207 |
| abstract_inverted_index.performance | 197 |
| abstract_inverted_index.significant | 37 |
| abstract_inverted_index.substantial | 33 |
| abstract_inverted_index.capabilities | 6 |
| abstract_inverted_index.implementing | 159 |
| abstract_inverted_index.quantization | 84, 118, 146, 153, 184 |
| abstract_inverted_index.replacements | 58 |
| abstract_inverted_index.Quantization. | 67 |
| abstract_inverted_index.demonstrating | 176 |
| abstract_inverted_index.equivalently. | 93 |
| abstract_inverted_index.quantization. | 137 |
| abstract_inverted_index.transformer's | 110 |
| abstract_inverted_index.configurations | 161 |
| abstract_inverted_index.floating-point | 60 |
| abstract_inverted_index.simultaneously | 195 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |