Pie: Pooling CPU Memory for LLM Inference Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2411.09317
The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput. This paper introduces Pie, an LLM inference framework that addresses these challenges with performance-transparent swapping and adaptive expansion. By leveraging predictable memory access patterns and the high bandwidth of modern hardware like the NVIDIA GH200 Grace Hopper Superchip, Pie enables concurrent data swapping without affecting foreground computation, expanding effective memory without added latency. Adaptive expansion dynamically adjusts CPU memory allocation based on real-time information, optimizing memory usage and performance under varying conditions. Pie maintains low computation latency, high throughput, and high elasticity. Our experimental evaluation demonstrates that Pie achieves optimal swapping policy during cache warmup and effectively balances increased memory capacity with negligible impact on computation. With its extended capacity, Pie outperforms vLLM by up to 1.9X in throughput and 2X in latency. Additionally, Pie can reduce GPU memory usage by up to 1.67X while maintaining the same performance. Compared to FlexGen, an offline profiling-based swapping solution, Pie achieves magnitudes lower latency and 9.4X higher throughput.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2411.09317
- https://arxiv.org/pdf/2411.09317
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4404450630
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4404450630Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2411.09317Digital Object Identifier
- Title
-
Pie: Pooling CPU Memory for LLM InferenceWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-11-14Full publication date if available
- Authors
-
Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion StoicaList of authors in order
- Landing page
-
https://arxiv.org/abs/2411.09317Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2411.09317Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2411.09317Direct OA link when available
- Concepts
-
Pooling, Inference, Computer science, Parallel computing, Artificial intelligenceTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4404450630 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2411.09317 |
| ids.doi | https://doi.org/10.48550/arxiv.2411.09317 |
| ids.openalex | https://openalex.org/W4404450630 |
| fwci | |
| type | preprint |
| title | Pie: Pooling CPU Memory for LLM Inference |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10054 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.8964999914169312 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1708 |
| topics[0].subfield.display_name | Hardware and Architecture |
| topics[0].display_name | Parallel Computing and Optimization Techniques |
| topics[1].id | https://openalex.org/T11181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.8435999751091003 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1705 |
| topics[1].subfield.display_name | Computer Networks and Communications |
| topics[1].display_name | Advanced Data Storage Technologies |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C70437156 |
| concepts[0].level | 2 |
| concepts[0].score | 0.770053505897522 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q7228652 |
| concepts[0].display_name | Pooling |
| concepts[1].id | https://openalex.org/C2776214188 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7212824821472168 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q408386 |
| concepts[1].display_name | Inference |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.6134549975395203 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C173608175 |
| concepts[3].level | 1 |
| concepts[3].score | 0.4849858283996582 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q232661 |
| concepts[3].display_name | Parallel computing |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.28265833854675293 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| keywords[0].id | https://openalex.org/keywords/pooling |
| keywords[0].score | 0.770053505897522 |
| keywords[0].display_name | Pooling |
| keywords[1].id | https://openalex.org/keywords/inference |
| keywords[1].score | 0.7212824821472168 |
| keywords[1].display_name | Inference |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.6134549975395203 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/parallel-computing |
| keywords[3].score | 0.4849858283996582 |
| keywords[3].display_name | Parallel computing |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.28265833854675293 |
| keywords[4].display_name | Artificial intelligence |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2411.09317 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2411.09317 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2411.09317 |
| locations[1].id | doi:10.48550/arxiv.2411.09317 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2411.09317 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5085919297 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-1679-3271 |
| authorships[0].author.display_name | Yi Xu |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Xu, Yi |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5011833703 |
| authorships[1].author.orcid | https://orcid.org/0009-0006-5985-0968 |
| authorships[1].author.display_name | Ziming Mao |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Mao, Ziming |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5042410262 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Xiangxi Mo |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Mo, Xiangxi |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5115590899 |
| authorships[3].author.orcid | https://orcid.org/0009-0000-0252-0673 |
| authorships[3].author.display_name | Shu Liu |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Liu, Shu |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5041920173 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-5373-0088 |
| authorships[4].author.display_name | Ion Stoica |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Stoica, Ion |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2411.09317 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Pie: Pooling CPU Memory for LLM Inference |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10054 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.8964999914169312 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1708 |
| primary_topic.subfield.display_name | Hardware and Architecture |
| primary_topic.display_name | Parallel Computing and Optimization Techniques |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2953234277, https://openalex.org/W2626256601, https://openalex.org/W147410782, https://openalex.org/W2900413183, https://openalex.org/W4390975304, https://openalex.org/W3022252430, https://openalex.org/W4287804464 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2411.09317 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2411.09317 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2411.09317 |
| primary_location.id | pmh:oai:arXiv.org:2411.09317 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2411.09317 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2411.09317 |
| publication_date | 2024-11-14 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.A | 23 |
| abstract_inverted_index.2X | 166 |
| abstract_inverted_index.AI | 11 |
| abstract_inverted_index.By | 64 |
| abstract_inverted_index.an | 50, 188 |
| abstract_inverted_index.by | 159, 176 |
| abstract_inverted_index.in | 40, 163, 167 |
| abstract_inverted_index.is | 26 |
| abstract_inverted_index.of | 3, 74 |
| abstract_inverted_index.on | 107, 150 |
| abstract_inverted_index.to | 27, 30, 161, 178, 186 |
| abstract_inverted_index.up | 160, 177 |
| abstract_inverted_index.CPU | 31, 103 |
| abstract_inverted_index.GPU | 173 |
| abstract_inverted_index.LLM | 51 |
| abstract_inverted_index.Our | 128 |
| abstract_inverted_index.Pie | 84, 118, 133, 156, 170, 193 |
| abstract_inverted_index.The | 0 |
| abstract_inverted_index.and | 10, 17, 43, 61, 70, 113, 125, 141, 165, 198 |
| abstract_inverted_index.but | 13 |
| abstract_inverted_index.can | 171 |
| abstract_inverted_index.has | 5 |
| abstract_inverted_index.its | 153 |
| abstract_inverted_index.low | 120 |
| abstract_inverted_index.the | 71, 78, 182 |
| abstract_inverted_index.1.9X | 162 |
| abstract_inverted_index.9.4X | 199 |
| abstract_inverted_index.LLMs | 4 |
| abstract_inverted_index.Pie, | 49 |
| abstract_inverted_index.This | 46 |
| abstract_inverted_index.With | 152 |
| abstract_inverted_index.data | 87 |
| abstract_inverted_index.high | 72, 123, 126 |
| abstract_inverted_index.like | 77 |
| abstract_inverted_index.over | 29 |
| abstract_inverted_index.same | 183 |
| abstract_inverted_index.size | 16 |
| abstract_inverted_index.that | 54, 132 |
| abstract_inverted_index.vLLM | 158 |
| abstract_inverted_index.with | 58, 147 |
| abstract_inverted_index.1.67X | 179 |
| abstract_inverted_index.GH200 | 80 |
| abstract_inverted_index.Grace | 81 |
| abstract_inverted_index.added | 97 |
| abstract_inverted_index.based | 106 |
| abstract_inverted_index.cache | 139 |
| abstract_inverted_index.lower | 44, 196 |
| abstract_inverted_index.often | 38 |
| abstract_inverted_index.paper | 47 |
| abstract_inverted_index.rapid | 1 |
| abstract_inverted_index.spill | 28 |
| abstract_inverted_index.their | 14 |
| abstract_inverted_index.these | 56 |
| abstract_inverted_index.under | 115 |
| abstract_inverted_index.usage | 112, 175 |
| abstract_inverted_index.while | 180 |
| abstract_inverted_index.Hopper | 82 |
| abstract_inverted_index.NVIDIA | 79 |
| abstract_inverted_index.access | 68 |
| abstract_inverted_index.common | 24 |
| abstract_inverted_index.during | 138 |
| abstract_inverted_index.growth | 2 |
| abstract_inverted_index.higher | 41, 200 |
| abstract_inverted_index.impact | 149 |
| abstract_inverted_index.memory | 18, 36, 67, 95, 104, 111, 145, 174 |
| abstract_inverted_index.modern | 75 |
| abstract_inverted_index.policy | 137 |
| abstract_inverted_index.reduce | 172 |
| abstract_inverted_index.warmup | 140 |
| abstract_inverted_index.GPU-CPU | 35 |
| abstract_inverted_index.adjusts | 102 |
| abstract_inverted_index.demands | 19 |
| abstract_inverted_index.enables | 85 |
| abstract_inverted_index.latency | 42, 197 |
| abstract_inverted_index.memory; | 32 |
| abstract_inverted_index.natural | 7 |
| abstract_inverted_index.offline | 189 |
| abstract_inverted_index.optimal | 135 |
| abstract_inverted_index.present | 20 |
| abstract_inverted_index.results | 39 |
| abstract_inverted_index.varying | 116 |
| abstract_inverted_index.without | 89, 96 |
| abstract_inverted_index.Adaptive | 99 |
| abstract_inverted_index.Compared | 185 |
| abstract_inverted_index.FlexGen, | 187 |
| abstract_inverted_index.achieves | 134, 194 |
| abstract_inverted_index.adaptive | 62 |
| abstract_inverted_index.balances | 143 |
| abstract_inverted_index.capacity | 146 |
| abstract_inverted_index.extended | 154 |
| abstract_inverted_index.hardware | 76 |
| abstract_inverted_index.however, | 33 |
| abstract_inverted_index.language | 8 |
| abstract_inverted_index.latency, | 122 |
| abstract_inverted_index.latency. | 98, 168 |
| abstract_inverted_index.patterns | 69 |
| abstract_inverted_index.solution | 25 |
| abstract_inverted_index.swapping | 37, 60, 88, 136, 191 |
| abstract_inverted_index.addresses | 55 |
| abstract_inverted_index.affecting | 90 |
| abstract_inverted_index.analysis, | 12 |
| abstract_inverted_index.bandwidth | 73 |
| abstract_inverted_index.capacity, | 155 |
| abstract_inverted_index.effective | 94 |
| abstract_inverted_index.expanding | 93 |
| abstract_inverted_index.expansion | 100 |
| abstract_inverted_index.framework | 53 |
| abstract_inverted_index.increased | 144 |
| abstract_inverted_index.inference | 52 |
| abstract_inverted_index.maintains | 119 |
| abstract_inverted_index.real-time | 108 |
| abstract_inverted_index.solution, | 192 |
| abstract_inverted_index.Superchip, | 83 |
| abstract_inverted_index.allocation | 105 |
| abstract_inverted_index.challenges | 57 |
| abstract_inverted_index.concurrent | 86 |
| abstract_inverted_index.evaluation | 130 |
| abstract_inverted_index.expansion. | 63 |
| abstract_inverted_index.foreground | 91 |
| abstract_inverted_index.increasing | 15 |
| abstract_inverted_index.introduces | 48 |
| abstract_inverted_index.leveraging | 65 |
| abstract_inverted_index.magnitudes | 195 |
| abstract_inverted_index.negligible | 148 |
| abstract_inverted_index.optimizing | 110 |
| abstract_inverted_index.processing | 9 |
| abstract_inverted_index.throughput | 164 |
| abstract_inverted_index.challenges. | 22 |
| abstract_inverted_index.computation | 121 |
| abstract_inverted_index.conditions. | 117 |
| abstract_inverted_index.dynamically | 101 |
| abstract_inverted_index.effectively | 142 |
| abstract_inverted_index.elasticity. | 127 |
| abstract_inverted_index.maintaining | 181 |
| abstract_inverted_index.outperforms | 157 |
| abstract_inverted_index.performance | 114 |
| abstract_inverted_index.predictable | 66 |
| abstract_inverted_index.significant | 21 |
| abstract_inverted_index.throughput, | 124 |
| abstract_inverted_index.throughput. | 45, 201 |
| abstract_inverted_index.traditional | 34 |
| abstract_inverted_index.computation, | 92 |
| abstract_inverted_index.computation. | 151 |
| abstract_inverted_index.demonstrates | 131 |
| abstract_inverted_index.experimental | 129 |
| abstract_inverted_index.information, | 109 |
| abstract_inverted_index.performance. | 184 |
| abstract_inverted_index.Additionally, | 169 |
| abstract_inverted_index.revolutionized | 6 |
| abstract_inverted_index.profiling-based | 190 |
| abstract_inverted_index.performance-transparent | 59 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile |