TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2504.02107
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2504.02107
- https://arxiv.org/pdf/2504.02107
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4409628940
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4409628940Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2504.02107Digital Object Identifier
- Title
-
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM PretrainingWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-02Full publication date if available
- Authors
-
Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash FaghriList of authors in order
- Landing page
-
https://arxiv.org/abs/2504.02107Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2504.02107Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2504.02107Direct OA link when available
- Concepts
-
Benchmark (surveying), Scale (ratio), Computer science, Artificial intelligence, Geography, CartographyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4409628940 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2504.02107 |
| ids.doi | https://doi.org/10.48550/arxiv.2504.02107 |
| ids.openalex | https://openalex.org/W4409628940 |
| fwci | |
| type | preprint |
| title | TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11201 |
| topics[0].field.id | https://openalex.org/fields/22 |
| topics[0].field.display_name | Engineering |
| topics[0].score | 0.9301999807357788 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/2211 |
| topics[0].subfield.display_name | Mechanics of Materials |
| topics[0].display_name | Metallurgy and Material Forming |
| topics[1].id | https://openalex.org/T11301 |
| topics[1].field.id | https://openalex.org/fields/22 |
| topics[1].field.display_name | Engineering |
| topics[1].score | 0.9086999893188477 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/2204 |
| topics[1].subfield.display_name | Biomedical Engineering |
| topics[1].display_name | Advanced Surface Polishing Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C185798385 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7529885768890381 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1161707 |
| concepts[0].display_name | Benchmark (surveying) |
| concepts[1].id | https://openalex.org/C2778755073 |
| concepts[1].level | 2 |
| concepts[1].score | 0.5902706980705261 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q10858537 |
| concepts[1].display_name | Scale (ratio) |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.5395940542221069 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.3306910991668701 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C205649164 |
| concepts[4].level | 0 |
| concepts[4].score | 0.11458861827850342 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q1071 |
| concepts[4].display_name | Geography |
| concepts[5].id | https://openalex.org/C58640448 |
| concepts[5].level | 1 |
| concepts[5].score | 0.10450336337089539 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q42515 |
| concepts[5].display_name | Cartography |
| keywords[0].id | https://openalex.org/keywords/benchmark |
| keywords[0].score | 0.7529885768890381 |
| keywords[0].display_name | Benchmark (surveying) |
| keywords[1].id | https://openalex.org/keywords/scale |
| keywords[1].score | 0.5902706980705261 |
| keywords[1].display_name | Scale (ratio) |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.5395940542221069 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.3306910991668701 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/geography |
| keywords[4].score | 0.11458861827850342 |
| keywords[4].display_name | Geography |
| keywords[5].id | https://openalex.org/keywords/cartography |
| keywords[5].score | 0.10450336337089539 |
| keywords[5].display_name | Cartography |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2504.02107 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2504.02107 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2504.02107 |
| locations[1].id | doi:10.48550/arxiv.2504.02107 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2504.02107 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5111296494 |
| authorships[0].author.orcid | https://orcid.org/0009-0002-4393-5773 |
| authorships[0].author.display_name | Jeffrey Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Li, Jeffrey |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5053849494 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Mohammadreza Armandpour |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Armandpour, Mohammadreza |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5079412282 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Iman Mirzadeh |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Mirzadeh, Iman |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5074132108 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-5420-4725 |
| authorships[3].author.display_name | Sachin Mehta |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Mehta, Sachin |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5112327867 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Vaishaal Shankar |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Shankar, Vaishaal |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5071825172 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-0425-7797 |
| authorships[5].author.display_name | Raviteja Vemulapalli |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Vemulapalli, Raviteja |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5017529415 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Samy Bengio |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Bengio, Samy |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5028613002 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Oncel Tuzel |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Tuzel, Oncel |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5050499655 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-5510-518X |
| authorships[8].author.display_name | Mehrdad Farajtabar |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Farajtabar, Mehrdad |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5059295598 |
| authorships[9].author.orcid | |
| authorships[9].author.display_name | Hadi Pouransari |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Pouransari, Hadi |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5036601505 |
| authorships[10].author.orcid | https://orcid.org/0000-0001-5975-5158 |
| authorships[10].author.display_name | Fartash Faghri |
| authorships[10].author_position | last |
| authorships[10].raw_author_name | Faghri, Fartash |
| authorships[10].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2504.02107 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11201 |
| primary_topic.field.id | https://openalex.org/fields/22 |
| primary_topic.field.display_name | Engineering |
| primary_topic.score | 0.9301999807357788 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/2211 |
| primary_topic.subfield.display_name | Mechanics of Materials |
| primary_topic.display_name | Metallurgy and Material Forming |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2378211422, https://openalex.org/W4321353415, https://openalex.org/W2745001401, https://openalex.org/W2130974462, https://openalex.org/W2028665553, https://openalex.org/W2086519370, https://openalex.org/W4246352526 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2504.02107 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2504.02107 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2504.02107 |
| primary_location.id | pmh:oai:arXiv.org:2504.02107 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2504.02107 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2504.02107 |
| publication_date | 2025-04-02 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.- | 44 |
| abstract_inverted_index.a | 28, 101 |
| abstract_inverted_index.CC | 63, 95 |
| abstract_inverted_index.We | 12, 26, 55 |
| abstract_inverted_index.as | 21, 135 |
| abstract_inverted_index.is | 137 |
| abstract_inverted_index.of | 34, 40, 46, 104 |
| abstract_inverted_index.on | 5, 93, 142, 149 |
| abstract_inverted_index.so | 148 |
| abstract_inverted_index.to | 73, 82, 112, 139 |
| abstract_inverted_index.114 | 38 |
| abstract_inverted_index.Our | 89 |
| abstract_inverted_index.and | 16, 65, 70, 130 |
| abstract_inverted_index.but | 146 |
| abstract_inverted_index.can | 107 |
| abstract_inverted_index.for | 19, 31 |
| abstract_inverted_index.how | 75 |
| abstract_inverted_index.new | 22, 83, 128 |
| abstract_inverted_index.old | 132 |
| abstract_inverted_index.the | 123 |
| abstract_inverted_index.web | 7, 144 |
| abstract_inverted_index.(CC) | 43 |
| abstract_inverted_index.LLMs | 20, 35 |
| abstract_inverted_index.also | 56 |
| abstract_inverted_index.both | 61 |
| abstract_inverted_index.code | 71 |
| abstract_inverted_index.data | 8, 23, 64, 84, 106, 129, 133, 145 |
| abstract_inverted_index.from | 37, 114 |
| abstract_inverted_index.less | 119, 147 |
| abstract_inverted_index.loss | 111 |
| abstract_inverted_index.past | 87 |
| abstract_inverted_index.than | 49 |
| abstract_inverted_index.well | 76 |
| abstract_inverted_index.with | 100 |
| abstract_inverted_index.Crawl | 42 |
| abstract_inverted_index.Large | 0 |
| abstract_inverted_index.adapt | 81 |
| abstract_inverted_index.avoid | 140 |
| abstract_inverted_index.data, | 96 |
| abstract_inverted_index.dumps | 39 |
| abstract_inverted_index.older | 105 |
| abstract_inverted_index.that, | 92 |
| abstract_inverted_index.while | 85, 116 |
| abstract_inverted_index.(LLMs) | 3 |
| abstract_inverted_index.Common | 41 |
| abstract_inverted_index.Models | 2 |
| abstract_inverted_index.across | 60 |
| abstract_inverted_index.assess | 74 |
| abstract_inverted_index.become | 10 |
| abstract_inverted_index.design | 57 |
| abstract_inverted_index.larger | 48 |
| abstract_inverted_index.orders | 45 |
| abstract_inverted_index.replay | 103, 136 |
| abstract_inverted_index.update | 17 |
| abstract_inverted_index.(2.6x). | 121 |
| abstract_inverted_index.achieve | 108 |
| abstract_inverted_index.balance | 125 |
| abstract_inverted_index.becomes | 24 |
| abstract_inverted_index.between | 126 |
| abstract_inverted_index.crucial | 138 |
| abstract_inverted_index.dataset | 30 |
| abstract_inverted_index.derived | 36 |
| abstract_inverted_index.differs | 134 |
| abstract_inverted_index.domains | 67 |
| abstract_inverted_index.general | 62, 94 |
| abstract_inverted_index.generic | 143 |
| abstract_inverted_index.methods | 18, 80 |
| abstract_inverted_index.optimal | 124 |
| abstract_inverted_index.trained | 4 |
| abstract_inverted_index.various | 77 |
| abstract_inverted_index.However, | 122 |
| abstract_inverted_index.Language | 1 |
| abstract_inverted_index.combined | 99 |
| abstract_inverted_index.domains. | 151 |
| abstract_inverted_index.findings | 90 |
| abstract_inverted_index.held-out | 110 |
| abstract_inverted_index.language | 52 |
| abstract_inverted_index.learning | 79 |
| abstract_inverted_index.modeling | 53 |
| abstract_inverted_index.previous | 50 |
| abstract_inverted_index.scratch, | 115 |
| abstract_inverted_index.specific | 66, 150 |
| abstract_inverted_index.continual | 51, 78 |
| abstract_inverted_index.introduce | 27 |
| abstract_inverted_index.magnitude | 47 |
| abstract_inverted_index.outdated. | 11 |
| abstract_inverted_index.replaying | 131 |
| abstract_inverted_index.requiring | 117 |
| abstract_inverted_index.retaining | 86 |
| abstract_inverted_index.web-scale | 29 |
| abstract_inverted_index.available. | 25 |
| abstract_inverted_index.comparable | 109 |
| abstract_inverted_index.evaluation | 14 |
| abstract_inverted_index.forgetting | 141 |
| abstract_inverted_index.historical | 6 |
| abstract_inverted_index.inevitably | 9 |
| abstract_inverted_index.knowledge. | 88 |
| abstract_inverted_index.strategies | 15 |
| abstract_inverted_index.(Wikipedia, | 68 |
| abstract_inverted_index.benchmarks. | 54 |
| abstract_inverted_index.computation | 120 |
| abstract_inverted_index.demonstrate | 91 |
| abstract_inverted_index.evaluations | 59 |
| abstract_inverted_index.fixed-ratio | 102 |
| abstract_inverted_index.investigate | 13 |
| abstract_inverted_index.pretraining | 33 |
| abstract_inverted_index.re-training | 113 |
| abstract_inverted_index.incorporating | 127 |
| abstract_inverted_index.significantly | 118 |
| abstract_inverted_index.StackExchange, | 69 |
| abstract_inverted_index.autoregressive | 97 |
| abstract_inverted_index.documentation) | 72 |
| abstract_inverted_index.meta-schedules | 98 |
| abstract_inverted_index.time-continual | 32 |
| abstract_inverted_index.time-stratified | 58 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 11 |
| citation_normalized_percentile |