Croatian word2vec embeddings trained on OpenSubtitles Part 1 Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.5281/zenodo.17507138
This dataset contains the subs2vec embeddings for Croatian, as presented in https://zenodo.org/records/17243814. The embeddings were trained on large-scale subtitle corpora and represent semantic vector spaces derived from naturalistic language use in films and television from the OpenSubtitles 2018 datasets: https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles. For this language, we provide all embedding variants explored in the study. Specifically, the dataset includes vectors generated under different combinations of: Dimensionality: multiple vector sizes (e.g., 100, 200, 300, …) Window size: varying context windows (e.g., 2, 5, 10, …) Each file corresponds to a unique configuration (dimension × window size). Each file contains the vocabulary for that language (column 1) and then the embedding values (columns 2 through dimension size + 1). If you use this dataset, please cite: Manuscript: https://doi.org/10.5281/zenodo.17243812 Data: This Zenodo dataset (using the DOI provided here)
Related Topics
- Type
- dataset
- Landing Page
- https://doi.org/10.5281/zenodo.17507138
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W7104516732
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W7104516732Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.5281/zenodo.17507138Digital Object Identifier
- Title
-
Croatian word2vec embeddings trained on OpenSubtitles Part 1Work title
- Type
-
datasetOpenAlex work type
- Publication year
-
2025Year of publication
- Publication date
-
2025-11-02Full publication date if available
- Authors
-
Grim, Philip, Buchanan, ErinList of authors in order
- Landing page
-
https://doi.org/10.5281/zenodo.17507138Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://doi.org/10.5281/zenodo.17507138Direct OA link when available
- Concepts
-
Word2vec, Computer science, Dimension (graph theory), Embedding, Window (computing), Context (archaeology), Artificial intelligence, Natural language processing, Vocabulary, Word embedding, Semantics (computer science), Vector space, Support vector machine, Language model, Word (group theory), Core (optical fiber), Information retrieval, Subtitle, Theoretical computer scienceTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W7104516732 |
|---|---|
| doi | https://doi.org/10.5281/zenodo.17507138 |
| ids.doi | https://doi.org/10.5281/zenodo.17507138 |
| ids.openalex | https://openalex.org/W7104516732 |
| fwci | |
| type | dataset |
| title | Croatian word2vec embeddings trained on OpenSubtitles Part 1 |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2776461190 |
| concepts[0].level | 3 |
| concepts[0].score | 0.9110686182975769 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q22673982 |
| concepts[0].display_name | Word2vec |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.6527162790298462 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C33676613 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6388274431228638 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q13415176 |
| concepts[2].display_name | Dimension (graph theory) |
| concepts[3].id | https://openalex.org/C41608201 |
| concepts[3].level | 2 |
| concepts[3].score | 0.6198755502700806 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q980509 |
| concepts[3].display_name | Embedding |
| concepts[4].id | https://openalex.org/C2778751112 |
| concepts[4].level | 2 |
| concepts[4].score | 0.6045960783958435 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q835016 |
| concepts[4].display_name | Window (computing) |
| concepts[5].id | https://openalex.org/C2779343474 |
| concepts[5].level | 2 |
| concepts[5].score | 0.5962949991226196 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q3109175 |
| concepts[5].display_name | Context (archaeology) |
| concepts[6].id | https://openalex.org/C154945302 |
| concepts[6].level | 1 |
| concepts[6].score | 0.5455676317214966 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[6].display_name | Artificial intelligence |
| concepts[7].id | https://openalex.org/C204321447 |
| concepts[7].level | 1 |
| concepts[7].score | 0.5275846123695374 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[7].display_name | Natural language processing |
| concepts[8].id | https://openalex.org/C2777601683 |
| concepts[8].level | 2 |
| concepts[8].score | 0.46453380584716797 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q6499736 |
| concepts[8].display_name | Vocabulary |
| concepts[9].id | https://openalex.org/C2777462759 |
| concepts[9].level | 3 |
| concepts[9].score | 0.41262388229370117 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q18395344 |
| concepts[9].display_name | Word embedding |
| concepts[10].id | https://openalex.org/C184337299 |
| concepts[10].level | 2 |
| concepts[10].score | 0.37688785791397095 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q1437428 |
| concepts[10].display_name | Semantics (computer science) |
| concepts[11].id | https://openalex.org/C13336665 |
| concepts[11].level | 2 |
| concepts[11].score | 0.33106496930122375 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q125977 |
| concepts[11].display_name | Vector space |
| concepts[12].id | https://openalex.org/C12267149 |
| concepts[12].level | 2 |
| concepts[12].score | 0.3037644028663635 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q282453 |
| concepts[12].display_name | Support vector machine |
| concepts[13].id | https://openalex.org/C137293760 |
| concepts[13].level | 2 |
| concepts[13].score | 0.30296558141708374 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q3621696 |
| concepts[13].display_name | Language model |
| concepts[14].id | https://openalex.org/C90805587 |
| concepts[14].level | 2 |
| concepts[14].score | 0.30058518052101135 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q10944557 |
| concepts[14].display_name | Word (group theory) |
| concepts[15].id | https://openalex.org/C2164484 |
| concepts[15].level | 2 |
| concepts[15].score | 0.2907838821411133 |
| concepts[15].wikidata | https://www.wikidata.org/wiki/Q5170150 |
| concepts[15].display_name | Core (optical fiber) |
| concepts[16].id | https://openalex.org/C23123220 |
| concepts[16].level | 1 |
| concepts[16].score | 0.28034213185310364 |
| concepts[16].wikidata | https://www.wikidata.org/wiki/Q816826 |
| concepts[16].display_name | Information retrieval |
| concepts[17].id | https://openalex.org/C2780364048 |
| concepts[17].level | 2 |
| concepts[17].score | 0.2559911608695984 |
| concepts[17].wikidata | https://www.wikidata.org/wiki/Q204028 |
| concepts[17].display_name | Subtitle |
| concepts[18].id | https://openalex.org/C80444323 |
| concepts[18].level | 1 |
| concepts[18].score | 0.25512078404426575 |
| concepts[18].wikidata | https://www.wikidata.org/wiki/Q2878974 |
| concepts[18].display_name | Theoretical computer science |
| keywords[0].id | https://openalex.org/keywords/word2vec |
| keywords[0].score | 0.9110686182975769 |
| keywords[0].display_name | Word2vec |
| keywords[1].id | https://openalex.org/keywords/dimension |
| keywords[1].score | 0.6388274431228638 |
| keywords[1].display_name | Dimension (graph theory) |
| keywords[2].id | https://openalex.org/keywords/embedding |
| keywords[2].score | 0.6198755502700806 |
| keywords[2].display_name | Embedding |
| keywords[3].id | https://openalex.org/keywords/window |
| keywords[3].score | 0.6045960783958435 |
| keywords[3].display_name | Window (computing) |
| keywords[4].id | https://openalex.org/keywords/context |
| keywords[4].score | 0.5962949991226196 |
| keywords[4].display_name | Context (archaeology) |
| keywords[5].id | https://openalex.org/keywords/vocabulary |
| keywords[5].score | 0.46453380584716797 |
| keywords[5].display_name | Vocabulary |
| keywords[6].id | https://openalex.org/keywords/word-embedding |
| keywords[6].score | 0.41262388229370117 |
| keywords[6].display_name | Word embedding |
| language | |
| locations[0].id | doi:10.5281/zenodo.17507138 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400562 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | Zenodo (CERN European Organization for Nuclear Research) |
| locations[0].source.host_organization | https://openalex.org/I67311998 |
| locations[0].source.host_organization_name | European Organization for Nuclear Research |
| locations[0].source.host_organization_lineage | https://openalex.org/I67311998 |
| locations[0].license | cc-by |
| locations[0].pdf_url | |
| locations[0].version | |
| locations[0].raw_type | dataset |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.5281/zenodo.17507138 |
| indexed_in | datacite |
| authorships[0].author.id | |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Grim, Philip |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Grim, Philip |
| authorships[0].is_corresponding | True |
| authorships[1].author.id | |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Buchanan, Erin |
| authorships[1].countries | US |
| authorships[1].affiliations[0].institution_ids | https://openalex.org/I153151563 |
| authorships[1].affiliations[0].raw_affiliation_string | Harrisburg University of Science and Technology |
| authorships[1].institutions[0].id | https://openalex.org/I153151563 |
| authorships[1].institutions[0].ror | https://ror.org/02g0s4z48 |
| authorships[1].institutions[0].type | education |
| authorships[1].institutions[0].lineage | https://openalex.org/I153151563 |
| authorships[1].institutions[0].country_code | US |
| authorships[1].institutions[0].display_name | Harrisburg University of Science and Technology |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Buchanan, Erin |
| authorships[1].is_corresponding | False |
| authorships[1].raw_affiliation_strings | Harrisburg University of Science and Technology |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://doi.org/10.5281/zenodo.17507138 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-11-10T00:00:00 |
| display_name | Croatian word2vec embeddings trained on OpenSubtitles Part 1 |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-10T23:18:03.357015 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | doi:10.5281/zenodo.17507138 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400562 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | Zenodo (CERN European Organization for Nuclear Research) |
| best_oa_location.source.host_organization | https://openalex.org/I67311998 |
| best_oa_location.source.host_organization_name | European Organization for Nuclear Research |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I67311998 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | |
| best_oa_location.version | |
| best_oa_location.raw_type | dataset |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.5281/zenodo.17507138 |
| primary_location.id | doi:10.5281/zenodo.17507138 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400562 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | Zenodo (CERN European Organization for Nuclear Research) |
| primary_location.source.host_organization | https://openalex.org/I67311998 |
| primary_location.source.host_organization_name | European Organization for Nuclear Research |
| primary_location.source.host_organization_lineage | https://openalex.org/I67311998 |
| primary_location.license | cc-by |
| primary_location.pdf_url | |
| primary_location.version | |
| primary_location.raw_type | dataset |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.5281/zenodo.17507138 |
| publication_date | 2025-11-02 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.+ | 112 |
| abstract_inverted_index.2 | 108 |
| abstract_inverted_index.a | 85 |
| abstract_inverted_index.1) | 101 |
| abstract_inverted_index.2, | 77 |
| abstract_inverted_index.5, | 78 |
| abstract_inverted_index.If | 114 |
| abstract_inverted_index.as | 8 |
| abstract_inverted_index.in | 10, 30, 49 |
| abstract_inverted_index.on | 16 |
| abstract_inverted_index.to | 84 |
| abstract_inverted_index.we | 43 |
| abstract_inverted_index.× | 89 |
| abstract_inverted_index.1). | 113 |
| abstract_inverted_index.10, | 79 |
| abstract_inverted_index.DOI | 129 |
| abstract_inverted_index.For | 40 |
| abstract_inverted_index.The | 12 |
| abstract_inverted_index.all | 45 |
| abstract_inverted_index.and | 20, 32, 102 |
| abstract_inverted_index.for | 6, 97 |
| abstract_inverted_index.of: | 61 |
| abstract_inverted_index.the | 3, 35, 50, 53, 95, 104, 128 |
| abstract_inverted_index.use | 29, 116 |
| abstract_inverted_index.you | 115 |
| abstract_inverted_index.100, | 67 |
| abstract_inverted_index.200, | 68 |
| abstract_inverted_index.2018 | 37 |
| abstract_inverted_index.300, | 69 |
| abstract_inverted_index.Each | 81, 92 |
| abstract_inverted_index.This | 0, 124 |
| abstract_inverted_index.file | 82, 93 |
| abstract_inverted_index.from | 26, 34 |
| abstract_inverted_index.size | 111 |
| abstract_inverted_index.that | 98 |
| abstract_inverted_index.then | 103 |
| abstract_inverted_index.this | 41, 117 |
| abstract_inverted_index.were | 14 |
| abstract_inverted_index.…) | 70, 80 |
| abstract_inverted_index.Data: | 123 |
| abstract_inverted_index.cite: | 120 |
| abstract_inverted_index.films | 31 |
| abstract_inverted_index.here) | 131 |
| abstract_inverted_index.size: | 72 |
| abstract_inverted_index.sizes | 65 |
| abstract_inverted_index.under | 58 |
| abstract_inverted_index.(e.g., | 66, 76 |
| abstract_inverted_index.(using | 127 |
| abstract_inverted_index.Window | 71 |
| abstract_inverted_index.Zenodo | 125 |
| abstract_inverted_index.please | 119 |
| abstract_inverted_index.size). | 91 |
| abstract_inverted_index.spaces | 24 |
| abstract_inverted_index.study. | 51 |
| abstract_inverted_index.unique | 86 |
| abstract_inverted_index.values | 106 |
| abstract_inverted_index.vector | 23, 64 |
| abstract_inverted_index.window | 90 |
| abstract_inverted_index.(column | 100 |
| abstract_inverted_index.context | 74 |
| abstract_inverted_index.corpora | 19 |
| abstract_inverted_index.dataset | 1, 54, 126 |
| abstract_inverted_index.derived | 25 |
| abstract_inverted_index.provide | 44 |
| abstract_inverted_index.through | 109 |
| abstract_inverted_index.trained | 15 |
| abstract_inverted_index.varying | 73 |
| abstract_inverted_index.vectors | 56 |
| abstract_inverted_index.windows | 75 |
| abstract_inverted_index.(columns | 107 |
| abstract_inverted_index.contains | 2, 94 |
| abstract_inverted_index.dataset, | 118 |
| abstract_inverted_index.explored | 48 |
| abstract_inverted_index.includes | 55 |
| abstract_inverted_index.language | 28, 99 |
| abstract_inverted_index.multiple | 63 |
| abstract_inverted_index.provided | 130 |
| abstract_inverted_index.semantic | 22 |
| abstract_inverted_index.subs2vec | 4 |
| abstract_inverted_index.subtitle | 18 |
| abstract_inverted_index.variants | 47 |
| abstract_inverted_index.Croatian, | 7 |
| abstract_inverted_index.datasets: | 38 |
| abstract_inverted_index.different | 59 |
| abstract_inverted_index.dimension | 110 |
| abstract_inverted_index.embedding | 46, 105 |
| abstract_inverted_index.generated | 57 |
| abstract_inverted_index.language, | 42 |
| abstract_inverted_index.presented | 9 |
| abstract_inverted_index.represent | 21 |
| abstract_inverted_index.(dimension | 88 |
| abstract_inverted_index.embeddings | 5, 13 |
| abstract_inverted_index.television | 33 |
| abstract_inverted_index.vocabulary | 96 |
| abstract_inverted_index.Manuscript: | 121 |
| abstract_inverted_index.corresponds | 83 |
| abstract_inverted_index.large-scale | 17 |
| abstract_inverted_index.combinations | 60 |
| abstract_inverted_index.naturalistic | 27 |
| abstract_inverted_index.OpenSubtitles | 36 |
| abstract_inverted_index.Specifically, | 52 |
| abstract_inverted_index.configuration | 87 |
| abstract_inverted_index.Dimensionality: | 62 |
| abstract_inverted_index.https://zenodo.org/records/17243814. | 11 |
| abstract_inverted_index.https://doi.org/10.5281/zenodo.17243812 | 122 |
| abstract_inverted_index.https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles. | 39 |
| cited_by_percentile_year | |
| countries_distinct_count | 1 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile |