Workload Failure Prediction for Data Centers Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2301.05176
Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has increased significantly over the years, detection and prediction of workload failures have lagged far behind and will become increasingly critical as the system scale and complexity further increase. In this study, we analyze workload traces collected from a production cluster and train machine learning models on a large amount of data sets to predict workload failures. Our prediction models consist of a queue-time model that estimates the probability of workload failures before execution and a runtime model that predicts failures at runtime. Evaluation results show that the queue-time model and runtime model can predict workload failures with a maximum precision score of 90.61% and 97.75%, respectively. By integrating the runtime model with the job scheduler, it helps reduce CPU time, and memory usage by up to 16.7% and 14.53%, respectively.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2301.05176
- https://arxiv.org/pdf/2301.05176
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4316135708
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4316135708Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2301.05176Digital Object Identifier
- Title
-
Workload Failure Prediction for Data CentersWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-01-12Full publication date if available
- Authors
-
Jie Li, Rui Wang, Ghazanfar Ali, Tommy Dang, A. Sill, Yong ChenList of authors in order
- Landing page
-
https://arxiv.org/abs/2301.05176Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2301.05176Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2301.05176Direct OA link when available
- Concepts
-
Workload, Computer science, Queue, Time limit, Limit (mathematics), Real-time computing, Distributed computing, Reliability engineering, Operating system, Computer network, Engineering, Systems engineering, Mathematics, Mathematical analysisTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4316135708 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2301.05176 |
| ids.doi | https://doi.org/10.48550/arxiv.2301.05176 |
| ids.openalex | https://openalex.org/W4316135708 |
| fwci | |
| type | preprint |
| title | Workload Failure Prediction for Data Centers |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10101 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9991999864578247 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1710 |
| topics[0].subfield.display_name | Information Systems |
| topics[0].display_name | Cloud Computing and Resource Management |
| topics[1].id | https://openalex.org/T10273 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9865999817848206 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1705 |
| topics[1].subfield.display_name | Computer Networks and Communications |
| topics[1].display_name | IoT and Edge/Fog Computing |
| topics[2].id | https://openalex.org/T12127 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9836000204086304 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1705 |
| topics[2].subfield.display_name | Computer Networks and Communications |
| topics[2].display_name | Software System Performance and Reliability |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2778476105 |
| concepts[0].level | 2 |
| concepts[0].score | 0.9370556473731995 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q628539 |
| concepts[0].display_name | Workload |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.7704778909683228 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C160403385 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6558209657669067 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q220543 |
| concepts[2].display_name | Queue |
| concepts[3].id | https://openalex.org/C2781011336 |
| concepts[3].level | 2 |
| concepts[3].score | 0.46967893838882446 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1465133 |
| concepts[3].display_name | Time limit |
| concepts[4].id | https://openalex.org/C151201525 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4307026267051697 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q177239 |
| concepts[4].display_name | Limit (mathematics) |
| concepts[5].id | https://openalex.org/C79403827 |
| concepts[5].level | 1 |
| concepts[5].score | 0.35210323333740234 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q3988 |
| concepts[5].display_name | Real-time computing |
| concepts[6].id | https://openalex.org/C120314980 |
| concepts[6].level | 1 |
| concepts[6].score | 0.35003966093063354 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q180634 |
| concepts[6].display_name | Distributed computing |
| concepts[7].id | https://openalex.org/C200601418 |
| concepts[7].level | 1 |
| concepts[7].score | 0.3226650059223175 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q2193887 |
| concepts[7].display_name | Reliability engineering |
| concepts[8].id | https://openalex.org/C111919701 |
| concepts[8].level | 1 |
| concepts[8].score | 0.23446914553642273 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q9135 |
| concepts[8].display_name | Operating system |
| concepts[9].id | https://openalex.org/C31258907 |
| concepts[9].level | 1 |
| concepts[9].score | 0.10517892241477966 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q1301371 |
| concepts[9].display_name | Computer network |
| concepts[10].id | https://openalex.org/C127413603 |
| concepts[10].level | 0 |
| concepts[10].score | 0.08074626326560974 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[10].display_name | Engineering |
| concepts[11].id | https://openalex.org/C201995342 |
| concepts[11].level | 1 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q682496 |
| concepts[11].display_name | Systems engineering |
| concepts[12].id | https://openalex.org/C33923547 |
| concepts[12].level | 0 |
| concepts[12].score | 0.0 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[12].display_name | Mathematics |
| concepts[13].id | https://openalex.org/C134306372 |
| concepts[13].level | 1 |
| concepts[13].score | 0.0 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q7754 |
| concepts[13].display_name | Mathematical analysis |
| keywords[0].id | https://openalex.org/keywords/workload |
| keywords[0].score | 0.9370556473731995 |
| keywords[0].display_name | Workload |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.7704778909683228 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/queue |
| keywords[2].score | 0.6558209657669067 |
| keywords[2].display_name | Queue |
| keywords[3].id | https://openalex.org/keywords/time-limit |
| keywords[3].score | 0.46967893838882446 |
| keywords[3].display_name | Time limit |
| keywords[4].id | https://openalex.org/keywords/limit |
| keywords[4].score | 0.4307026267051697 |
| keywords[4].display_name | Limit (mathematics) |
| keywords[5].id | https://openalex.org/keywords/real-time-computing |
| keywords[5].score | 0.35210323333740234 |
| keywords[5].display_name | Real-time computing |
| keywords[6].id | https://openalex.org/keywords/distributed-computing |
| keywords[6].score | 0.35003966093063354 |
| keywords[6].display_name | Distributed computing |
| keywords[7].id | https://openalex.org/keywords/reliability-engineering |
| keywords[7].score | 0.3226650059223175 |
| keywords[7].display_name | Reliability engineering |
| keywords[8].id | https://openalex.org/keywords/operating-system |
| keywords[8].score | 0.23446914553642273 |
| keywords[8].display_name | Operating system |
| keywords[9].id | https://openalex.org/keywords/computer-network |
| keywords[9].score | 0.10517892241477966 |
| keywords[9].display_name | Computer network |
| keywords[10].id | https://openalex.org/keywords/engineering |
| keywords[10].score | 0.08074626326560974 |
| keywords[10].display_name | Engineering |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2301.05176 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by-nc-sa |
| locations[0].pdf_url | https://arxiv.org/pdf/2301.05176 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | https://openalex.org/licenses/cc-by-nc-sa |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2301.05176 |
| locations[1].id | doi:10.48550/arxiv.2301.05176 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2301.05176 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100428213 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-1094-1563 |
| authorships[0].author.display_name | Jie Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Li, Jie |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100431163 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-9048-2979 |
| authorships[1].author.display_name | Rui Wang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Wang, Rui |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5112864996 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Ghazanfar Ali |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Ali, Ghazanfar |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5032280607 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-8322-0014 |
| authorships[3].author.display_name | Tommy Dang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Dang, Tommy |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5073805319 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-2527-764X |
| authorships[4].author.display_name | A. Sill |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Sill, Alan |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5077521546 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-9961-9051 |
| authorships[5].author.display_name | Yong Chen |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Chen, Yong |
| authorships[5].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2301.05176 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Workload Failure Prediction for Data Centers |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10101 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9991999864578247 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1710 |
| primary_topic.subfield.display_name | Information Systems |
| primary_topic.display_name | Cloud Computing and Resource Management |
| related_works | https://openalex.org/W2087914290, https://openalex.org/W2945274325, https://openalex.org/W1916816090, https://openalex.org/W2375762310, https://openalex.org/W1225023418, https://openalex.org/W2367766247, https://openalex.org/W2363664274, https://openalex.org/W2379345400, https://openalex.org/W2350086347, https://openalex.org/W2361702302 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2301.05176 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by-nc-sa |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2301.05176 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by-nc-sa |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2301.05176 |
| primary_location.id | pmh:oai:arXiv.org:2301.05176 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by-nc-sa |
| primary_location.pdf_url | https://arxiv.org/pdf/2301.05176 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | https://openalex.org/licenses/cc-by-nc-sa |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2301.05176 |
| publication_date | 2023-01-12 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 72, 81, 96, 109, 132 |
| abstract_inverted_index.By | 141 |
| abstract_inverted_index.In | 63 |
| abstract_inverted_index.as | 55 |
| abstract_inverted_index.at | 115 |
| abstract_inverted_index.be | 28 |
| abstract_inverted_index.by | 158 |
| abstract_inverted_index.in | 7 |
| abstract_inverted_index.it | 150 |
| abstract_inverted_index.of | 14, 23, 43, 84, 95, 103, 136 |
| abstract_inverted_index.on | 80 |
| abstract_inverted_index.to | 87, 160 |
| abstract_inverted_index.up | 159 |
| abstract_inverted_index.we | 66 |
| abstract_inverted_index.CPU | 153 |
| abstract_inverted_index.Our | 91 |
| abstract_inverted_index.and | 9, 18, 41, 50, 59, 75, 108, 124, 138, 155, 162 |
| abstract_inverted_index.can | 27, 127 |
| abstract_inverted_index.far | 48 |
| abstract_inverted_index.has | 34 |
| abstract_inverted_index.job | 148 |
| abstract_inverted_index.the | 12, 21, 31, 38, 56, 101, 121, 143, 147 |
| abstract_inverted_index.data | 15, 85 |
| abstract_inverted_index.from | 71 |
| abstract_inverted_index.have | 46 |
| abstract_inverted_index.over | 37 |
| abstract_inverted_index.sets | 86 |
| abstract_inverted_index.show | 119 |
| abstract_inverted_index.that | 2, 26, 99, 112, 120 |
| abstract_inverted_index.this | 64 |
| abstract_inverted_index.thus | 19 |
| abstract_inverted_index.time | 8 |
| abstract_inverted_index.will | 51 |
| abstract_inverted_index.with | 131, 146 |
| abstract_inverted_index.work | 25 |
| abstract_inverted_index.16.7% | 161 |
| abstract_inverted_index.While | 30 |
| abstract_inverted_index.helps | 151 |
| abstract_inverted_index.large | 82 |
| abstract_inverted_index.limit | 20 |
| abstract_inverted_index.model | 98, 111, 123, 126, 145 |
| abstract_inverted_index.power | 33 |
| abstract_inverted_index.scale | 58 |
| abstract_inverted_index.score | 135 |
| abstract_inverted_index.space | 10 |
| abstract_inverted_index.time, | 154 |
| abstract_inverted_index.train | 76 |
| abstract_inverted_index.usage | 157 |
| abstract_inverted_index.90.61% | 137 |
| abstract_inverted_index.Failed | 0 |
| abstract_inverted_index.affect | 11 |
| abstract_inverted_index.amount | 22, 83 |
| abstract_inverted_index.become | 52 |
| abstract_inverted_index.before | 106 |
| abstract_inverted_index.behind | 49 |
| abstract_inverted_index.lagged | 47 |
| abstract_inverted_index.memory | 156 |
| abstract_inverted_index.models | 79, 93 |
| abstract_inverted_index.reduce | 152 |
| abstract_inverted_index.study, | 65 |
| abstract_inverted_index.system | 57 |
| abstract_inverted_index.traces | 69 |
| abstract_inverted_index.years, | 39 |
| abstract_inverted_index.14.53%, | 163 |
| abstract_inverted_index.97.75%, | 139 |
| abstract_inverted_index.analyze | 67 |
| abstract_inverted_index.centers | 16 |
| abstract_inverted_index.cluster | 74 |
| abstract_inverted_index.consist | 94 |
| abstract_inverted_index.further | 61 |
| abstract_inverted_index.machine | 77 |
| abstract_inverted_index.maximum | 133 |
| abstract_inverted_index.predict | 88, 128 |
| abstract_inverted_index.results | 118 |
| abstract_inverted_index.runtime | 110, 125, 144 |
| abstract_inverted_index.consumed | 3 |
| abstract_inverted_index.critical | 54 |
| abstract_inverted_index.failures | 45, 105, 114, 130 |
| abstract_inverted_index.learning | 78 |
| abstract_inverted_index.predicts | 113 |
| abstract_inverted_index.runtime. | 116 |
| abstract_inverted_index.workload | 44, 68, 89, 104, 129 |
| abstract_inverted_index.achieved. | 29 |
| abstract_inverted_index.collected | 70 |
| abstract_inverted_index.detection | 40 |
| abstract_inverted_index.estimates | 100 |
| abstract_inverted_index.execution | 107 |
| abstract_inverted_index.failures. | 90 |
| abstract_inverted_index.increase. | 62 |
| abstract_inverted_index.increased | 35 |
| abstract_inverted_index.precision | 134 |
| abstract_inverted_index.resources | 6 |
| abstract_inverted_index.workloads | 1 |
| abstract_inverted_index.Evaluation | 117 |
| abstract_inverted_index.complexity | 60 |
| abstract_inverted_index.efficiency | 13 |
| abstract_inverted_index.prediction | 42, 92 |
| abstract_inverted_index.production | 73 |
| abstract_inverted_index.queue-time | 97, 122 |
| abstract_inverted_index.scheduler, | 149 |
| abstract_inverted_index.scientific | 24 |
| abstract_inverted_index.integrating | 142 |
| abstract_inverted_index.probability | 102 |
| abstract_inverted_index.significant | 4 |
| abstract_inverted_index.increasingly | 53 |
| abstract_inverted_index.computational | 5, 32 |
| abstract_inverted_index.respectively. | 140, 164 |
| abstract_inverted_index.significantly | 17, 36 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/8 |
| sustainable_development_goals[0].score | 0.49000000953674316 |
| sustainable_development_goals[0].display_name | Decent work and economic growth |
| citation_normalized_percentile |