Efficient LLM Serving on Hybrid Real-time and Best-effort Requests Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2504.09590
Recent breakthroughs in large Language Models (LLMs) have enabled various generative tasks on a single model. Real-world services (e.g., OpenAI's ChatGPT [27]) powered by an LLM often concurrently support latency-critical requests for interactive applications (e.g., question-answering systems, referred to as real-time or RT requests) and throughput-oriented requests for back-of-house processing (e.g., documents batch processing [28], referred to best-effort or BE requests), with complex hybrid inference workloads to the underlying model. State-of-the-art (SOTA) LLM serving systems dedicate machines to each type of request, towards either low inference latency or high serving throughput, respectively. This practice simplifies request scheduling and management but suffers from poor resource utilization. We propose BROS, a hybrid LLM serving system that aims to collocate RT/BE requests, meeting RT requests' latency requirements while maintaining BE requests' throughput. BROS formulates the problem of hybrid RT/BE request scheduling and solves it with a dynamic priority-based algorithm. BROS designs a bidirectional KV cache management mechanism, allowing RT requests to share KV memory with BE requests to remove the scheduling restrictions caused by insufficient KV memory and improve utilization. Extensive experiments validate that BROS achieves a good trade-off when serving hybrid RT and BE requests. It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests, showing significant advantages over SOTA systems like vLLM and TGI.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2504.09590
- https://arxiv.org/pdf/2504.09590
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415157826
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415157826Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2504.09590Digital Object Identifier
- Title
-
Efficient LLM Serving on Hybrid Real-time and Best-effort RequestsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-13Full publication date if available
- Authors
-
Wan Borui, Juntao Zhao, Jiang Chenyu, Chuanxiong Guo, Chuansong WuList of authors in order
- Landing page
-
https://arxiv.org/abs/2504.09590Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2504.09590Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2504.09590Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415157826 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2504.09590 |
| ids.doi | https://doi.org/10.48550/arxiv.2504.09590 |
| ids.openalex | https://openalex.org/W4415157826 |
| fwci | |
| type | preprint |
| title | Efficient LLM Serving on Hybrid Real-time and Best-effort Requests |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10444 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9861000180244446 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Context-Aware Activity Recognition Systems |
| topics[1].id | https://openalex.org/T10715 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.983299970626831 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1705 |
| topics[1].subfield.display_name | Computer Networks and Communications |
| topics[1].display_name | Distributed and Parallel Computing Systems |
| topics[2].id | https://openalex.org/T10933 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9749000072479248 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1708 |
| topics[2].subfield.display_name | Hardware and Architecture |
| topics[2].display_name | Real-Time Systems Scheduling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2504.09590 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2504.09590 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2504.09590 |
| locations[1].id | doi:10.48550/arxiv.2504.09590 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2504.09590 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5001219659 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Wan Borui |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Borui, Wan |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5087869611 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-3376-0607 |
| authorships[1].author.display_name | Juntao Zhao |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Juntao, Zhao |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5119989843 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Jiang Chenyu |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Chenyu, Jiang |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5054205326 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-0730-8468 |
| authorships[3].author.display_name | Chuanxiong Guo |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Chuanxiong, Guo |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100368781 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-8459-5678 |
| authorships[4].author.display_name | Chuansong Wu |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Chuan, Wu |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2504.09590 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-14T00:00:00 |
| display_name | Efficient LLM Serving on Hybrid Real-time and Best-effort Requests |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10444 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9861000180244446 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Context-Aware Activity Recognition Systems |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2504.09590 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2504.09590 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2504.09590 |
| primary_location.id | pmh:oai:arXiv.org:2504.09590 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2504.09590 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2504.09590 |
| publication_date | 2025-04-13 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 13, 108, 142, 148, 183 |
| abstract_inverted_index.BE | 59, 126, 162, 191, 220 |
| abstract_inverted_index.It | 193 |
| abstract_inverted_index.KV | 150, 159, 172 |
| abstract_inverted_index.RT | 42, 120, 155, 189, 199 |
| abstract_inverted_index.We | 105 |
| abstract_inverted_index.an | 24 |
| abstract_inverted_index.as | 39 |
| abstract_inverted_index.by | 23, 170 |
| abstract_inverted_index.in | 2 |
| abstract_inverted_index.it | 140 |
| abstract_inverted_index.of | 80, 133, 198 |
| abstract_inverted_index.on | 12 |
| abstract_inverted_index.or | 41, 58, 87 |
| abstract_inverted_index.to | 38, 56, 66, 77, 115, 157, 164, 202, 213 |
| abstract_inverted_index.(up | 201, 212 |
| abstract_inverted_index.LLM | 25, 72, 110 |
| abstract_inverted_index.and | 44, 97, 138, 174, 190, 230 |
| abstract_inverted_index.but | 99 |
| abstract_inverted_index.for | 31, 47, 219 |
| abstract_inverted_index.low | 84 |
| abstract_inverted_index.the | 67, 131, 166, 196 |
| abstract_inverted_index.BROS | 129, 146, 181 |
| abstract_inverted_index.SOTA | 226 |
| abstract_inverted_index.TGI. | 231 |
| abstract_inverted_index.This | 92 |
| abstract_inverted_index.aims | 114 |
| abstract_inverted_index.each | 78 |
| abstract_inverted_index.from | 101 |
| abstract_inverted_index.good | 184 |
| abstract_inverted_index.have | 7 |
| abstract_inverted_index.high | 88 |
| abstract_inverted_index.like | 228 |
| abstract_inverted_index.over | 225 |
| abstract_inverted_index.poor | 102 |
| abstract_inverted_index.that | 113, 180 |
| abstract_inverted_index.type | 79 |
| abstract_inverted_index.vLLM | 229 |
| abstract_inverted_index.when | 186 |
| abstract_inverted_index.with | 61, 141, 161, 215 |
| abstract_inverted_index.BROS, | 107 |
| abstract_inverted_index.RT/BE | 117, 135 |
| abstract_inverted_index.[27]) | 21 |
| abstract_inverted_index.[28], | 54 |
| abstract_inverted_index.batch | 52 |
| abstract_inverted_index.cache | 151 |
| abstract_inverted_index.large | 3 |
| abstract_inverted_index.level | 208 |
| abstract_inverted_index.often | 26 |
| abstract_inverted_index.share | 158 |
| abstract_inverted_index.tasks | 11 |
| abstract_inverted_index.their | 205 |
| abstract_inverted_index.while | 124 |
| abstract_inverted_index.(LLMs) | 6 |
| abstract_inverted_index.(SLOs) | 210 |
| abstract_inverted_index.(SOTA) | 71 |
| abstract_inverted_index.(e.g., | 18, 34, 50 |
| abstract_inverted_index.Models | 5 |
| abstract_inverted_index.Recent | 0 |
| abstract_inverted_index.caused | 169 |
| abstract_inverted_index.either | 83 |
| abstract_inverted_index.hybrid | 63, 109, 134, 188 |
| abstract_inverted_index.memory | 160, 173 |
| abstract_inverted_index.model. | 15, 69 |
| abstract_inverted_index.remove | 165 |
| abstract_inverted_index.single | 14 |
| abstract_inverted_index.solves | 139 |
| abstract_inverted_index.system | 112 |
| abstract_inverted_index.ChatGPT | 20 |
| abstract_inverted_index.complex | 62 |
| abstract_inverted_index.designs | 147 |
| abstract_inverted_index.dynamic | 143 |
| abstract_inverted_index.enabled | 8 |
| abstract_inverted_index.improve | 175 |
| abstract_inverted_index.latency | 86, 122, 197 |
| abstract_inverted_index.meeting | 119 |
| abstract_inverted_index.powered | 22 |
| abstract_inverted_index.problem | 132 |
| abstract_inverted_index.propose | 106 |
| abstract_inverted_index.reduces | 195 |
| abstract_inverted_index.request | 95, 136 |
| abstract_inverted_index.service | 207 |
| abstract_inverted_index.serving | 73, 89, 111, 187 |
| abstract_inverted_index.showing | 222 |
| abstract_inverted_index.suffers | 100 |
| abstract_inverted_index.support | 28 |
| abstract_inverted_index.systems | 74, 227 |
| abstract_inverted_index.towards | 82 |
| abstract_inverted_index.various | 9 |
| abstract_inverted_index.36.38x), | 214 |
| abstract_inverted_index.74.20%), | 203 |
| abstract_inverted_index.Language | 4 |
| abstract_inverted_index.OpenAI's | 19 |
| abstract_inverted_index.achieves | 182 |
| abstract_inverted_index.allowing | 154 |
| abstract_inverted_index.dedicate | 75 |
| abstract_inverted_index.machines | 76 |
| abstract_inverted_index.practice | 93 |
| abstract_inverted_index.referred | 37, 55 |
| abstract_inverted_index.request, | 81 |
| abstract_inverted_index.requests | 30, 46, 156, 163, 200 |
| abstract_inverted_index.resource | 103 |
| abstract_inverted_index.services | 17 |
| abstract_inverted_index.systems, | 36 |
| abstract_inverted_index.validate | 179 |
| abstract_inverted_index.Extensive | 177 |
| abstract_inverted_index.collocate | 116 |
| abstract_inverted_index.documents | 51 |
| abstract_inverted_index.improving | 204 |
| abstract_inverted_index.inference | 64, 85 |
| abstract_inverted_index.real-time | 40 |
| abstract_inverted_index.reduction | 218 |
| abstract_inverted_index.requests' | 121, 127 |
| abstract_inverted_index.requests) | 43 |
| abstract_inverted_index.requests, | 118, 221 |
| abstract_inverted_index.requests. | 192 |
| abstract_inverted_index.trade-off | 185 |
| abstract_inverted_index.workloads | 65 |
| abstract_inverted_index.Real-world | 16 |
| abstract_inverted_index.advantages | 224 |
| abstract_inverted_index.algorithm. | 145 |
| abstract_inverted_index.formulates | 130 |
| abstract_inverted_index.generative | 10 |
| abstract_inverted_index.management | 98, 152 |
| abstract_inverted_index.mechanism, | 153 |
| abstract_inverted_index.negligible | 216 |
| abstract_inverted_index.objectives | 209 |
| abstract_inverted_index.processing | 49, 53 |
| abstract_inverted_index.requests), | 60 |
| abstract_inverted_index.scheduling | 96, 137, 167 |
| abstract_inverted_index.simplifies | 94 |
| abstract_inverted_index.throughput | 217 |
| abstract_inverted_index.underlying | 68 |
| abstract_inverted_index.attainments | 211 |
| abstract_inverted_index.best-effort | 57 |
| abstract_inverted_index.experiments | 178 |
| abstract_inverted_index.interactive | 32 |
| abstract_inverted_index.maintaining | 125 |
| abstract_inverted_index.significant | 223 |
| abstract_inverted_index.throughput, | 90 |
| abstract_inverted_index.throughput. | 128 |
| abstract_inverted_index.applications | 33 |
| abstract_inverted_index.concurrently | 27 |
| abstract_inverted_index.fine-grained | 206 |
| abstract_inverted_index.insufficient | 171 |
| abstract_inverted_index.requirements | 123 |
| abstract_inverted_index.restrictions | 168 |
| abstract_inverted_index.utilization. | 104, 176 |
| abstract_inverted_index.back-of-house | 48 |
| abstract_inverted_index.bidirectional | 149 |
| abstract_inverted_index.breakthroughs | 1 |
| abstract_inverted_index.respectively. | 91 |
| abstract_inverted_index.significantly | 194 |
| abstract_inverted_index.priority-based | 144 |
| abstract_inverted_index.State-of-the-art | 70 |
| abstract_inverted_index.latency-critical | 29 |
| abstract_inverted_index.question-answering | 35 |
| abstract_inverted_index.throughput-oriented | 45 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile |