DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2504.09983
The increasing scale of deep learning models has led to the development of various parallelization strategies for distributed training across accelerators. For example, fully sharded approaches like DeepSpeed ZeRO-3 and FSDP partition the parameters of each layer across multiple GPUs and gather them through communication when needed. These methods rely on optimizations such as prefetching, which initiates communication early to overlap it with computation and reduce communication overhead, and unsharding, which retains as many parameters in their unsharded form as possible to reduce communication volume. Although the timing of prefetching should be adjusted in response to dynamic memory usage during execution, these systems lack the flexibility to control it, which limits the benefits of prefetching. Moreover, they cannot anticipate how memory usage will change after prefetching is applied, making it difficult to combine it effectively with other optimizations such as unsharding. We present DeepCompile, which compiles user-defined models into computation graphs and applies a sequence of profiling-guided optimization passes for distributed training. Taking dynamic memory usage into account, these passes flexibly insert, reorder, or remove operations to improve communication-computation overlap, reduce memory pressure, and coordinate multiple optimizations in a unified manner. To evaluate the effectiveness of this design, we implemented a fully sharded approach like ZeRO-3 and FSDP on top of DeepCompile, along with three optimizations: proactive prefetching, selective unsharding, and adaptive offloading. We evaluate DeepCompile on the training of Llama 3 70B and Mixtral 8x7B MoE models. DeepCompile achieves up to 1.28x and 1.54x performance improvements over ZeRO-3 and FSDP baselines, respectively, and up to a 7.01x throughput increase with limited GPU resources, using offloading.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2504.09983
- https://arxiv.org/pdf/2504.09983
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415158855
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415158855Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2504.09983Digital Object Identifier
- Title
-
DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning TrainingWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-14Full publication date if available
- Authors
-
Masahiro Tanaka, Canbing Li, Umesh Chand, Zafar Ali, Haiying Shen, Olatunji RuwaseList of authors in order
- Landing page
-
https://arxiv.org/abs/2504.09983Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2504.09983Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2504.09983Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415158855 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2504.09983 |
| ids.doi | https://doi.org/10.48550/arxiv.2504.09983 |
| ids.openalex | https://openalex.org/W4415158855 |
| fwci | |
| type | preprint |
| title | DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11986 |
| topics[0].field.id | https://openalex.org/fields/18 |
| topics[0].field.display_name | Decision Sciences |
| topics[0].score | 0.7253999710083008 |
| topics[0].domain.id | https://openalex.org/domains/2 |
| topics[0].domain.display_name | Social Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1802 |
| topics[0].subfield.display_name | Information Systems and Management |
| topics[0].display_name | Scientific Computing and Data Management |
| topics[1].id | https://openalex.org/T10054 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.6784999966621399 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1708 |
| topics[1].subfield.display_name | Hardware and Architecture |
| topics[1].display_name | Parallel Computing and Optimization Techniques |
| topics[2].id | https://openalex.org/T13382 |
| topics[2].field.id | https://openalex.org/fields/22 |
| topics[2].field.display_name | Engineering |
| topics[2].score | 0.6499999761581421 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/2207 |
| topics[2].subfield.display_name | Control and Systems Engineering |
| topics[2].display_name | Robotics and Automated Systems |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2504.09983 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2504.09983 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2504.09983 |
| locations[1].id | doi:10.48550/arxiv.2504.09983 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2504.09983 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5103228553 |
| authorships[0].author.orcid | https://orcid.org/0009-0001-3972-8133 |
| authorships[0].author.display_name | Masahiro Tanaka |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Tanaka, Masahiro |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100695977 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-9116-7487 |
| authorships[1].author.display_name | Canbing Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Li, Du |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5057201706 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-9302-5965 |
| authorships[2].author.display_name | Umesh Chand |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Chand, Umesh |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5022377161 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-6404-645X |
| authorships[3].author.display_name | Zafar Ali |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Zafar, Ali |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5050569064 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-7548-6223 |
| authorships[4].author.display_name | Haiying Shen |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Shen, Haiying |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5022644245 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-5508-0728 |
| authorships[5].author.display_name | Olatunji Ruwase |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Ruwase, Olatunji |
| authorships[5].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2504.09983 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-14T00:00:00 |
| display_name | DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11986 |
| primary_topic.field.id | https://openalex.org/fields/18 |
| primary_topic.field.display_name | Decision Sciences |
| primary_topic.score | 0.7253999710083008 |
| primary_topic.domain.id | https://openalex.org/domains/2 |
| primary_topic.domain.display_name | Social Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1802 |
| primary_topic.subfield.display_name | Information Systems and Management |
| primary_topic.display_name | Scientific Computing and Data Management |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2504.09983 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2504.09983 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2504.09983 |
| primary_location.id | pmh:oai:arXiv.org:2504.09983 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2504.09983 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2504.09983 |
| publication_date | 2025-04-14 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.3 | 231 |
| abstract_inverted_index.a | 153, 188, 200, 256 |
| abstract_inverted_index.To | 191 |
| abstract_inverted_index.We | 141, 223 |
| abstract_inverted_index.as | 53, 72, 79, 139 |
| abstract_inverted_index.be | 91 |
| abstract_inverted_index.in | 75, 93, 187 |
| abstract_inverted_index.is | 126 |
| abstract_inverted_index.it | 61, 129, 133 |
| abstract_inverted_index.of | 3, 12, 34, 88, 113, 155, 195, 210, 229 |
| abstract_inverted_index.on | 50, 208, 226 |
| abstract_inverted_index.or | 173 |
| abstract_inverted_index.to | 9, 59, 81, 95, 106, 131, 176, 241, 255 |
| abstract_inverted_index.up | 240, 254 |
| abstract_inverted_index.we | 198 |
| abstract_inverted_index.70B | 232 |
| abstract_inverted_index.For | 21 |
| abstract_inverted_index.GPU | 262 |
| abstract_inverted_index.MoE | 236 |
| abstract_inverted_index.The | 0 |
| abstract_inverted_index.and | 29, 40, 64, 68, 151, 183, 206, 220, 233, 243, 249, 253 |
| abstract_inverted_index.for | 16, 159 |
| abstract_inverted_index.has | 7 |
| abstract_inverted_index.how | 119 |
| abstract_inverted_index.it, | 108 |
| abstract_inverted_index.led | 8 |
| abstract_inverted_index.the | 10, 32, 86, 104, 111, 193, 227 |
| abstract_inverted_index.top | 209 |
| abstract_inverted_index.8x7B | 235 |
| abstract_inverted_index.FSDP | 30, 207, 250 |
| abstract_inverted_index.GPUs | 39 |
| abstract_inverted_index.deep | 4 |
| abstract_inverted_index.each | 35 |
| abstract_inverted_index.form | 78 |
| abstract_inverted_index.into | 148, 166 |
| abstract_inverted_index.lack | 103 |
| abstract_inverted_index.like | 26, 204 |
| abstract_inverted_index.many | 73 |
| abstract_inverted_index.over | 247 |
| abstract_inverted_index.rely | 49 |
| abstract_inverted_index.such | 52, 138 |
| abstract_inverted_index.them | 42 |
| abstract_inverted_index.they | 116 |
| abstract_inverted_index.this | 196 |
| abstract_inverted_index.when | 45 |
| abstract_inverted_index.will | 122 |
| abstract_inverted_index.with | 62, 135, 213, 260 |
| abstract_inverted_index.1.28x | 242 |
| abstract_inverted_index.1.54x | 244 |
| abstract_inverted_index.7.01x | 257 |
| abstract_inverted_index.Llama | 230 |
| abstract_inverted_index.These | 47 |
| abstract_inverted_index.after | 124 |
| abstract_inverted_index.along | 212 |
| abstract_inverted_index.early | 58 |
| abstract_inverted_index.fully | 23, 201 |
| abstract_inverted_index.layer | 36 |
| abstract_inverted_index.other | 136 |
| abstract_inverted_index.scale | 2 |
| abstract_inverted_index.their | 76 |
| abstract_inverted_index.these | 101, 168 |
| abstract_inverted_index.three | 214 |
| abstract_inverted_index.usage | 98, 121, 165 |
| abstract_inverted_index.using | 264 |
| abstract_inverted_index.which | 55, 70, 109, 144 |
| abstract_inverted_index.Taking | 162 |
| abstract_inverted_index.ZeRO-3 | 28, 205, 248 |
| abstract_inverted_index.across | 19, 37 |
| abstract_inverted_index.cannot | 117 |
| abstract_inverted_index.change | 123 |
| abstract_inverted_index.during | 99 |
| abstract_inverted_index.gather | 41 |
| abstract_inverted_index.graphs | 150 |
| abstract_inverted_index.limits | 110 |
| abstract_inverted_index.making | 128 |
| abstract_inverted_index.memory | 97, 120, 164, 181 |
| abstract_inverted_index.models | 6, 147 |
| abstract_inverted_index.passes | 158, 169 |
| abstract_inverted_index.reduce | 65, 82, 180 |
| abstract_inverted_index.remove | 174 |
| abstract_inverted_index.should | 90 |
| abstract_inverted_index.timing | 87 |
| abstract_inverted_index.Mixtral | 234 |
| abstract_inverted_index.applies | 152 |
| abstract_inverted_index.combine | 132 |
| abstract_inverted_index.control | 107 |
| abstract_inverted_index.design, | 197 |
| abstract_inverted_index.dynamic | 96, 163 |
| abstract_inverted_index.improve | 177 |
| abstract_inverted_index.insert, | 171 |
| abstract_inverted_index.limited | 261 |
| abstract_inverted_index.manner. | 190 |
| abstract_inverted_index.methods | 48 |
| abstract_inverted_index.models. | 237 |
| abstract_inverted_index.needed. | 46 |
| abstract_inverted_index.overlap | 60 |
| abstract_inverted_index.present | 142 |
| abstract_inverted_index.retains | 71 |
| abstract_inverted_index.sharded | 24, 202 |
| abstract_inverted_index.systems | 102 |
| abstract_inverted_index.through | 43 |
| abstract_inverted_index.unified | 189 |
| abstract_inverted_index.various | 13 |
| abstract_inverted_index.volume. | 84 |
| abstract_inverted_index.Although | 85 |
| abstract_inverted_index.account, | 167 |
| abstract_inverted_index.achieves | 239 |
| abstract_inverted_index.adaptive | 221 |
| abstract_inverted_index.adjusted | 92 |
| abstract_inverted_index.applied, | 127 |
| abstract_inverted_index.approach | 203 |
| abstract_inverted_index.benefits | 112 |
| abstract_inverted_index.compiles | 145 |
| abstract_inverted_index.evaluate | 192, 224 |
| abstract_inverted_index.example, | 22 |
| abstract_inverted_index.flexibly | 170 |
| abstract_inverted_index.increase | 259 |
| abstract_inverted_index.learning | 5 |
| abstract_inverted_index.multiple | 38, 185 |
| abstract_inverted_index.overlap, | 179 |
| abstract_inverted_index.possible | 80 |
| abstract_inverted_index.reorder, | 172 |
| abstract_inverted_index.response | 94 |
| abstract_inverted_index.sequence | 154 |
| abstract_inverted_index.training | 18, 228 |
| abstract_inverted_index.DeepSpeed | 27 |
| abstract_inverted_index.Moreover, | 115 |
| abstract_inverted_index.difficult | 130 |
| abstract_inverted_index.initiates | 56 |
| abstract_inverted_index.overhead, | 67 |
| abstract_inverted_index.partition | 31 |
| abstract_inverted_index.pressure, | 182 |
| abstract_inverted_index.proactive | 216 |
| abstract_inverted_index.selective | 218 |
| abstract_inverted_index.training. | 161 |
| abstract_inverted_index.unsharded | 77 |
| abstract_inverted_index.anticipate | 118 |
| abstract_inverted_index.approaches | 25 |
| abstract_inverted_index.baselines, | 251 |
| abstract_inverted_index.coordinate | 184 |
| abstract_inverted_index.execution, | 100 |
| abstract_inverted_index.increasing | 1 |
| abstract_inverted_index.operations | 175 |
| abstract_inverted_index.parameters | 33, 74 |
| abstract_inverted_index.resources, | 263 |
| abstract_inverted_index.strategies | 15 |
| abstract_inverted_index.throughput | 258 |
| abstract_inverted_index.DeepCompile | 225, 238 |
| abstract_inverted_index.computation | 63, 149 |
| abstract_inverted_index.development | 11 |
| abstract_inverted_index.distributed | 17, 160 |
| abstract_inverted_index.effectively | 134 |
| abstract_inverted_index.flexibility | 105 |
| abstract_inverted_index.implemented | 199 |
| abstract_inverted_index.offloading. | 222, 265 |
| abstract_inverted_index.performance | 245 |
| abstract_inverted_index.prefetching | 89, 125 |
| abstract_inverted_index.unsharding, | 69, 219 |
| abstract_inverted_index.unsharding. | 140 |
| abstract_inverted_index.DeepCompile, | 143, 211 |
| abstract_inverted_index.improvements | 246 |
| abstract_inverted_index.optimization | 157 |
| abstract_inverted_index.prefetching, | 54, 217 |
| abstract_inverted_index.prefetching. | 114 |
| abstract_inverted_index.user-defined | 146 |
| abstract_inverted_index.accelerators. | 20 |
| abstract_inverted_index.communication | 44, 57, 66, 83 |
| abstract_inverted_index.effectiveness | 194 |
| abstract_inverted_index.optimizations | 51, 137, 186 |
| abstract_inverted_index.respectively, | 252 |
| abstract_inverted_index.optimizations: | 215 |
| abstract_inverted_index.parallelization | 14 |
| abstract_inverted_index.profiling-guided | 156 |
| abstract_inverted_index.communication-computation | 178 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |