InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2409.04992
The widespread of Large Language Models (LLMs) marks a significant milestone in generative AI. Nevertheless, the increasing context length and batch size in offline LLM inference escalate the memory requirement of the key-value (KV) cache, which imposes a huge burden on the GPU VRAM, especially for resource-constraint scenarios (e.g., edge computing and personal devices). Several cost-effective solutions leverage host memory or SSDs to reduce storage costs for offline inference scenarios and improve the throughput. Nevertheless, they suffer from significant performance penalties imposed by intensive KV cache accesses due to limited PCIe bandwidth. To address these issues, we propose InstInfer, a novel LLM inference system that offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs), which minimize the enormous KV transfer overheads. InstInfer designs a dedicated flash-aware in-storage attention engine with KV cache management mechanisms to exploit the high internal bandwidths of CSDs instead of being limited by the PCIe bandwidth. The optimized P2P transmission between GPU and CSDs further reduces data migration overheads. Experimental results demonstrate that for a 13B model using an NVIDIA A6000 GPU, InstInfer improves throughput for long-sequence inference by up to 11.1$\times$, compared to existing SSD-based solutions such as FlexGen.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2409.04992
- https://arxiv.org/pdf/2409.04992
- OA Status
- green
- Cited By
- 1
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403617301
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403617301Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2409.04992Digital Object Identifier
- Title
-
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM InferenceWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-09-08Full publication date if available
- Authors
-
Xiurui Pan, Erzhong Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie ZhangList of authors in order
- Landing page
-
https://arxiv.org/abs/2409.04992Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2409.04992Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2409.04992Direct OA link when available
- Concepts
-
Inference, Context (archaeology), Computer science, Human–computer interaction, Artificial intelligence, Biology, PaleontologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
1Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403617301 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2409.04992 |
| ids.doi | https://doi.org/10.48550/arxiv.2409.04992 |
| ids.openalex | https://openalex.org/W4403617301 |
| fwci | |
| type | preprint |
| title | InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9861000180244446 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1705 |
| topics[0].subfield.display_name | Computer Networks and Communications |
| topics[0].display_name | Advanced Data Storage Technologies |
| topics[1].id | https://openalex.org/T10036 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9785000085830688 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Advanced Neural Network Applications |
| topics[2].id | https://openalex.org/T10054 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9635999798774719 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1708 |
| topics[2].subfield.display_name | Hardware and Architecture |
| topics[2].display_name | Parallel Computing and Optimization Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2776214188 |
| concepts[0].level | 2 |
| concepts[0].score | 0.710321307182312 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q408386 |
| concepts[0].display_name | Inference |
| concepts[1].id | https://openalex.org/C2779343474 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6678645610809326 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q3109175 |
| concepts[1].display_name | Context (archaeology) |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.6363983154296875 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C107457646 |
| concepts[3].level | 1 |
| concepts[3].score | 0.32926493883132935 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q207434 |
| concepts[3].display_name | Human–computer interaction |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.3088997006416321 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| concepts[5].id | https://openalex.org/C86803240 |
| concepts[5].level | 0 |
| concepts[5].score | 0.0960400402545929 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[5].display_name | Biology |
| concepts[6].id | https://openalex.org/C151730666 |
| concepts[6].level | 1 |
| concepts[6].score | 0.0 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q7205 |
| concepts[6].display_name | Paleontology |
| keywords[0].id | https://openalex.org/keywords/inference |
| keywords[0].score | 0.710321307182312 |
| keywords[0].display_name | Inference |
| keywords[1].id | https://openalex.org/keywords/context |
| keywords[1].score | 0.6678645610809326 |
| keywords[1].display_name | Context (archaeology) |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.6363983154296875 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/human–computer-interaction |
| keywords[3].score | 0.32926493883132935 |
| keywords[3].display_name | Human–computer interaction |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.3088997006416321 |
| keywords[4].display_name | Artificial intelligence |
| keywords[5].id | https://openalex.org/keywords/biology |
| keywords[5].score | 0.0960400402545929 |
| keywords[5].display_name | Biology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2409.04992 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2409.04992 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2409.04992 |
| locations[1].id | doi:10.48550/arxiv.2409.04992 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2409.04992 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5011711693 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-2528-2660 |
| authorships[0].author.display_name | Xiurui Pan |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Pan, Xiurui |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5019668670 |
| authorships[1].author.orcid | https://orcid.org/0009-0004-4130-0912 |
| authorships[1].author.display_name | Erzhong Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Li, Endian |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5112014055 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-3766-0681 |
| authorships[2].author.display_name | Qiao Li |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Li, Qiao |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5018381533 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-8407-2594 |
| authorships[3].author.display_name | Shengwen Liang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Liang, Shengwen |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5111117324 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Yizhou Shan |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Shan, Yizhou |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5015061573 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-2161-8796 |
| authorships[5].author.display_name | Ke Zhou |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Zhou, Ke |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5007062927 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Yingwei Luo |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Luo, Yingwei |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5100395178 |
| authorships[7].author.orcid | https://orcid.org/0000-0003-4293-7523 |
| authorships[7].author.display_name | Xiaolin Wang |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Wang, Xiaolin |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5101711826 |
| authorships[8].author.orcid | https://orcid.org/0000-0001-9803-7140 |
| authorships[8].author.display_name | Jie Zhang |
| authorships[8].author_position | last |
| authorships[8].raw_author_name | Zhang, Jie |
| authorships[8].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2409.04992 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-10-22T00:00:00 |
| display_name | InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9861000180244446 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1705 |
| primary_topic.subfield.display_name | Computer Networks and Communications |
| primary_topic.display_name | Advanced Data Storage Technologies |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052 |
| cited_by_count | 1 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2409.04992 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2409.04992 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2409.04992 |
| primary_location.id | pmh:oai:arXiv.org:2409.04992 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2409.04992 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2409.04992 |
| publication_date | 2024-09-08 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 8, 37, 99, 135, 180 |
| abstract_inverted_index.KV | 84, 118, 130, 142 |
| abstract_inverted_index.To | 92 |
| abstract_inverted_index.an | 184 |
| abstract_inverted_index.as | 204 |
| abstract_inverted_index.by | 82, 158, 194 |
| abstract_inverted_index.in | 11, 22, 112 |
| abstract_inverted_index.of | 2, 30, 152, 155 |
| abstract_inverted_index.on | 40 |
| abstract_inverted_index.or | 60 |
| abstract_inverted_index.to | 62, 88, 121, 146, 196, 199 |
| abstract_inverted_index.up | 195 |
| abstract_inverted_index.we | 96 |
| abstract_inverted_index.13B | 181 |
| abstract_inverted_index.AI. | 13 |
| abstract_inverted_index.GPU | 42, 167 |
| abstract_inverted_index.LLM | 24, 101 |
| abstract_inverted_index.P2P | 164 |
| abstract_inverted_index.The | 0, 162 |
| abstract_inverted_index.and | 19, 51, 70, 115, 168 |
| abstract_inverted_index.due | 87 |
| abstract_inverted_index.for | 45, 66, 179, 191 |
| abstract_inverted_index.the | 15, 27, 31, 41, 72, 106, 128, 148, 159 |
| abstract_inverted_index.(KV) | 33 |
| abstract_inverted_index.CSDs | 153, 169 |
| abstract_inverted_index.GPU, | 187 |
| abstract_inverted_index.PCIe | 90, 160 |
| abstract_inverted_index.SSDs | 61 |
| abstract_inverted_index.data | 116, 172 |
| abstract_inverted_index.edge | 49 |
| abstract_inverted_index.from | 77 |
| abstract_inverted_index.high | 149 |
| abstract_inverted_index.host | 58 |
| abstract_inverted_index.huge | 38 |
| abstract_inverted_index.most | 107 |
| abstract_inverted_index.size | 21 |
| abstract_inverted_index.such | 203 |
| abstract_inverted_index.that | 104, 178 |
| abstract_inverted_index.they | 75 |
| abstract_inverted_index.with | 141 |
| abstract_inverted_index.A6000 | 186 |
| abstract_inverted_index.Large | 3 |
| abstract_inverted_index.VRAM, | 43 |
| abstract_inverted_index.batch | 20 |
| abstract_inverted_index.being | 156 |
| abstract_inverted_index.cache | 85, 143 |
| abstract_inverted_index.costs | 65 |
| abstract_inverted_index.marks | 7 |
| abstract_inverted_index.model | 182 |
| abstract_inverted_index.novel | 100 |
| abstract_inverted_index.parts | 120 |
| abstract_inverted_index.these | 94 |
| abstract_inverted_index.using | 183 |
| abstract_inverted_index.which | 35, 126 |
| abstract_inverted_index.(LLMs) | 6 |
| abstract_inverted_index.(e.g., | 48 |
| abstract_inverted_index.(i.e., | 110, 117 |
| abstract_inverted_index.Drives | 124 |
| abstract_inverted_index.Models | 5 |
| abstract_inverted_index.NVIDIA | 185 |
| abstract_inverted_index.burden | 39 |
| abstract_inverted_index.cache) | 119 |
| abstract_inverted_index.cache, | 34 |
| abstract_inverted_index.engine | 140 |
| abstract_inverted_index.length | 18 |
| abstract_inverted_index.memory | 28, 59 |
| abstract_inverted_index.phase) | 114 |
| abstract_inverted_index.reduce | 63 |
| abstract_inverted_index.suffer | 76 |
| abstract_inverted_index.system | 103 |
| abstract_inverted_index.(CSDs), | 125 |
| abstract_inverted_index.Several | 54 |
| abstract_inverted_index.Storage | 123 |
| abstract_inverted_index.address | 93 |
| abstract_inverted_index.between | 166 |
| abstract_inverted_index.context | 17 |
| abstract_inverted_index.designs | 134 |
| abstract_inverted_index.exploit | 147 |
| abstract_inverted_index.further | 170 |
| abstract_inverted_index.imposed | 81 |
| abstract_inverted_index.imposes | 36 |
| abstract_inverted_index.improve | 71 |
| abstract_inverted_index.instead | 154 |
| abstract_inverted_index.issues, | 95 |
| abstract_inverted_index.limited | 89, 157 |
| abstract_inverted_index.offline | 23, 67 |
| abstract_inverted_index.propose | 97 |
| abstract_inverted_index.reduces | 171 |
| abstract_inverted_index.results | 176 |
| abstract_inverted_index.storage | 64 |
| abstract_inverted_index.FlexGen. | 205 |
| abstract_inverted_index.Language | 4 |
| abstract_inverted_index.accesses | 86 |
| abstract_inverted_index.compared | 198 |
| abstract_inverted_index.decoding | 113 |
| abstract_inverted_index.enormous | 129 |
| abstract_inverted_index.escalate | 26 |
| abstract_inverted_index.existing | 200 |
| abstract_inverted_index.improves | 189 |
| abstract_inverted_index.internal | 150 |
| abstract_inverted_index.leverage | 57 |
| abstract_inverted_index.minimize | 127 |
| abstract_inverted_index.offloads | 105 |
| abstract_inverted_index.personal | 52 |
| abstract_inverted_index.transfer | 131 |
| abstract_inverted_index.InstInfer | 133, 188 |
| abstract_inverted_index.SSD-based | 201 |
| abstract_inverted_index.attention | 111, 139 |
| abstract_inverted_index.computing | 50 |
| abstract_inverted_index.dedicated | 136 |
| abstract_inverted_index.devices). | 53 |
| abstract_inverted_index.inference | 25, 68, 102, 193 |
| abstract_inverted_index.intensive | 83 |
| abstract_inverted_index.key-value | 32 |
| abstract_inverted_index.migration | 173 |
| abstract_inverted_index.milestone | 10 |
| abstract_inverted_index.optimized | 163 |
| abstract_inverted_index.penalties | 80 |
| abstract_inverted_index.scenarios | 47, 69 |
| abstract_inverted_index.solutions | 56, 202 |
| abstract_inverted_index.InstInfer, | 98 |
| abstract_inverted_index.bandwidth. | 91, 161 |
| abstract_inverted_index.bandwidths | 151 |
| abstract_inverted_index.especially | 44 |
| abstract_inverted_index.generative | 12 |
| abstract_inverted_index.in-storage | 138 |
| abstract_inverted_index.increasing | 16 |
| abstract_inverted_index.management | 144 |
| abstract_inverted_index.mechanisms | 145 |
| abstract_inverted_index.overheads. | 132, 174 |
| abstract_inverted_index.throughput | 190 |
| abstract_inverted_index.widespread | 1 |
| abstract_inverted_index.computation | 109 |
| abstract_inverted_index.demonstrate | 177 |
| abstract_inverted_index.flash-aware | 137 |
| abstract_inverted_index.performance | 79 |
| abstract_inverted_index.requirement | 29 |
| abstract_inverted_index.significant | 9, 78 |
| abstract_inverted_index.throughput. | 73 |
| abstract_inverted_index.Experimental | 175 |
| abstract_inverted_index.transmission | 165 |
| abstract_inverted_index.11.1$\times$, | 197 |
| abstract_inverted_index.Computational | 122 |
| abstract_inverted_index.Nevertheless, | 14, 74 |
| abstract_inverted_index.long-sequence | 192 |
| abstract_inverted_index.cost-effective | 55 |
| abstract_inverted_index.resource-constraint | 46 |
| abstract_inverted_index.performance-critical | 108 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 9 |
| citation_normalized_percentile |