DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2409.09143
Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2409.09143
- https://arxiv.org/pdf/2409.09143
- OA Status
- green
- Cited By
- 2
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403666679
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403666679Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2409.09143Digital Object Identifier
- Title
-
DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and ClassificationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-09-13Full publication date if available
- Authors
-
Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismaïl BerradaList of authors in order
- Landing page
-
https://arxiv.org/abs/2409.09143Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2409.09143Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2409.09143Direct OA link when available
- Concepts
-
Computer science, Artificial intelligenceTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
2Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 2Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403666679 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2409.09143 |
| ids.doi | https://doi.org/10.48550/arxiv.2409.09143 |
| ids.openalex | https://openalex.org/W4403666679 |
| fwci | |
| type | preprint |
| title | DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11644 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9916999936103821 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1710 |
| topics[0].subfield.display_name | Information Systems |
| topics[0].display_name | Spam and Phishing Detection |
| topics[1].id | https://openalex.org/T11241 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9768000245094299 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1711 |
| topics[1].subfield.display_name | Signal Processing |
| topics[1].display_name | Advanced Malware Detection Techniques |
| topics[2].id | https://openalex.org/T10400 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9765999913215637 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1705 |
| topics[2].subfield.display_name | Computer Networks and Communications |
| topics[2].display_name | Network Security and Intrusion Detection |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.6598128080368042 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C154945302 |
| concepts[1].level | 1 |
| concepts[1].score | 0.3894503712654114 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[1].display_name | Artificial intelligence |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.6598128080368042 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[1].score | 0.3894503712654114 |
| keywords[1].display_name | Artificial intelligence |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2409.09143 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2409.09143 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2409.09143 |
| locations[1].id | doi:10.48550/arxiv.2409.09143 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2409.09143 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5009342678 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-4281-2472 |
| authorships[0].author.display_name | Abdelkader El Mahdaouy |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Mahdaouy, Abdelkader El |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5070022854 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-8789-5713 |
| authorships[1].author.display_name | Salima Lamsiyah |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Lamsiyah, Salima |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5022321148 |
| authorships[2].author.orcid | https://orcid.org/0009-0004-8066-5439 |
| authorships[2].author.display_name | Meryem Janati Idrissi |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Idrissi, Meryem Janati |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5028888571 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-6945-6098 |
| authorships[3].author.display_name | Hamza Alami |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Alami, Hamza |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5092532254 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Zakaria Yartaoui |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Yartaoui, Zakaria |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5091770877 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-4225-911X |
| authorships[5].author.display_name | Ismaïl Berrada |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Berrada, Ismail |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2409.09143 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11644 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9916999936103821 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1710 |
| primary_topic.subfield.display_name | Information Systems |
| primary_topic.display_name | Spam and Phishing Detection |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052 |
| cited_by_count | 2 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 2 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2409.09143 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2409.09143 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2409.09143 |
| primary_location.id | pmh:oai:arXiv.org:2409.09143 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2409.09143 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2409.09143 |
| publication_date | 2024-09-13 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 82, 106 |
| abstract_inverted_index.In | 76, 120 |
| abstract_inverted_index.To | 15 |
| abstract_inverted_index.in | 13, 56 |
| abstract_inverted_index.is | 10, 96 |
| abstract_inverted_index.of | 19, 30, 71, 110, 126 |
| abstract_inverted_index.on | 105, 132 |
| abstract_inverted_index.or | 4 |
| abstract_inverted_index.to | 40, 122 |
| abstract_inverted_index.we | 79, 128 |
| abstract_inverted_index.DNS | 149 |
| abstract_inverted_index.The | 151, 174 |
| abstract_inverted_index.and | 1, 8, 23, 27, 34, 43, 66, 74, 89, 93, 114, 135, 142, 148, 165, 172, 181 |
| abstract_inverted_index.are | 186 |
| abstract_inverted_index.few | 49 |
| abstract_inverted_index.for | 87 |
| abstract_inverted_index.has | 52 |
| abstract_inverted_index.the | 47, 69, 99, 124, 156, 177, 182 |
| abstract_inverted_index.BERT | 167 |
| abstract_inverted_index.DGA, | 147 |
| abstract_inverted_index.Over | 46 |
| abstract_inverted_index.URLs | 9 |
| abstract_inverted_index.been | 53 |
| abstract_inverted_index.code | 185 |
| abstract_inverted_index.deep | 162 |
| abstract_inverted_index.fail | 39 |
| abstract_inverted_index.have | 129 |
| abstract_inverted_index.past | 48 |
| abstract_inverted_index.show | 154 |
| abstract_inverted_index.such | 17 |
| abstract_inverted_index.task | 12 |
| abstract_inverted_index.that | 61, 155 |
| abstract_inverted_index.this | 77 |
| abstract_inverted_index.(DGA) | 118 |
| abstract_inverted_index.(MLM) | 103 |
| abstract_inverted_index.URLs, | 67, 111, 143 |
| abstract_inverted_index.URLs. | 35, 94 |
| abstract_inverted_index.known | 31 |
| abstract_inverted_index.large | 107 |
| abstract_inverted_index.names | 7, 141 |
| abstract_inverted_index.often | 25 |
| abstract_inverted_index.order | 121 |
| abstract_inverted_index.tasks | 138, 171 |
| abstract_inverted_index.there | 51 |
| abstract_inverted_index.using | 98 |
| abstract_inverted_index.Domain | 115 |
| abstract_inverted_index.Masked | 100 |
| abstract_inverted_index.across | 169 |
| abstract_inverted_index.assess | 123 |
| abstract_inverted_index.binary | 134 |
| abstract_inverted_index.corpus | 109 |
| abstract_inverted_index.detect | 63 |
| abstract_inverted_index.domain | 6, 112, 140 |
| abstract_inverted_index.models | 60, 164, 168 |
| abstract_inverted_index.names, | 113 |
| abstract_inverted_index.paper, | 78 |
| abstract_inverted_index.source | 184 |
| abstract_inverted_index.update | 28 |
| abstract_inverted_index.adapted | 86 |
| abstract_inverted_index.domains | 33, 65, 92 |
| abstract_inverted_index.encoder | 85, 158 |
| abstract_inverted_index.machine | 58 |
| abstract_inverted_index.results | 153 |
| abstract_inverted_index.several | 133 |
| abstract_inverted_index.vendors | 22 |
| abstract_inverted_index.However, | 36 |
| abstract_inverted_index.Language | 101 |
| abstract_inverted_index.Modeling | 102 |
| abstract_inverted_index.covering | 144 |
| abstract_inverted_index.dataset, | 176 |
| abstract_inverted_index.dataset. | 119 |
| abstract_inverted_index.decades, | 50 |
| abstract_inverted_index.emerging | 42 |
| abstract_inverted_index.encoder, | 180 |
| abstract_inverted_index.identify | 41 |
| abstract_inverted_index.interest | 55 |
| abstract_inverted_index.learning | 59, 163 |
| abstract_inverted_index.leverage | 16 |
| abstract_inverted_index.maintain | 26 |
| abstract_inverted_index.malware, | 146 |
| abstract_inverted_index.multiple | 170 |
| abstract_inverted_index.proposed | 157 |
| abstract_inverted_index.publicly | 187 |
| abstract_inverted_index.threats. | 45 |
| abstract_inverted_index.updates. | 75 |
| abstract_inverted_index.Detecting | 0 |
| abstract_inverted_index.conducted | 130 |
| abstract_inverted_index.datasets. | 173 |
| abstract_inverted_index.detecting | 88 |
| abstract_inverted_index.introduce | 80 |
| abstract_inverted_index.involving | 139 |
| abstract_inverted_index.malicious | 5, 32, 64 |
| abstract_inverted_index.objective | 104 |
| abstract_inverted_index.phishing, | 145 |
| abstract_inverted_index.Algorithms | 117 |
| abstract_inverted_index.BERT-based | 84 |
| abstract_inverted_index.Generation | 116 |
| abstract_inverted_index.addressing | 68 |
| abstract_inverted_index.available. | 188 |
| abstract_inverted_index.blacklists | 29, 37, 72 |
| abstract_inverted_index.developing | 57 |
| abstract_inverted_index.frequently | 38 |
| abstract_inverted_index.indicators | 18 |
| abstract_inverted_index.obfuscated | 44 |
| abstract_inverted_index.suspicious | 3 |
| abstract_inverted_index.tunneling. | 150 |
| abstract_inverted_index.classifying | 2, 90 |
| abstract_inverted_index.compromise, | 20 |
| abstract_inverted_index.evaluations | 152 |
| abstract_inverted_index.experiments | 131, 183 |
| abstract_inverted_index.fundamental | 11 |
| abstract_inverted_index.limitations | 70 |
| abstract_inverted_index.maintenance | 73 |
| abstract_inverted_index.multi-class | 136 |
| abstract_inverted_index.outperforms | 159 |
| abstract_inverted_index.performance | 125 |
| abstract_inverted_index.pre-trained | 83, 97, 178 |
| abstract_inverted_index.significant | 54 |
| abstract_inverted_index.DomURLs_BERT | 95, 179 |
| abstract_inverted_index.multilingual | 108 |
| abstract_inverted_index.pre-training | 175 |
| abstract_inverted_index.DomURLs_BERT, | 81, 127 |
| abstract_inverted_index.automatically | 62 |
| abstract_inverted_index.cybersecurity | 21 |
| abstract_inverted_index.practitioners | 24 |
| abstract_inverted_index.classification | 137 |
| abstract_inverted_index.cybersecurity. | 14 |
| abstract_inverted_index.character-based | 161 |
| abstract_inverted_index.state-of-the-art | 160 |
| abstract_inverted_index.suspicious/malicious | 91 |
| abstract_inverted_index.cybersecurity-focused | 166 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |