Score Combination for Improved Parallel Corpus Filtering for Low\n Resource Conditions Article Swipe
Muhammad ElNokrashy
,
Amr Hendy
,
Mohamed Abdelghaffar
,
Mohamed Afify
,
Ahmed Tawfik
,
Hany Hassan Awadalla
·
YOU?
·
· 2020
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2011.07933
YOU?
·
· 2020
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2011.07933
This paper describes our submission to the WMT20 sentence filtering task. We\ncombine scores from (1) a custom LASER built for each source language, (2) a\nclassifier built to distinguish positive and negative pairs by semantic\nalignment, and (3) the original scores included in the task devkit. For the\nmBART finetuning setup, provided by the organizers, our method shows 7% and 5%\nrelative improvement over baseline, in sacreBLEU score on the test set for\nPashto and Khmer respectively.\n
Related Topics
Concepts
Metadata
- Type
- preprint
- Landing Page
- http://arxiv.org/abs/2011.07933
- https://arxiv.org/pdf/2011.07933
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4287597277
All OpenAlex metadata
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4287597277Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2011.07933Digital Object Identifier
- Title
-
Score Combination for Improved Parallel Corpus Filtering for Low\n Resource ConditionsWork title
- Type
-
preprintOpenAlex work type
- Publication year
-
2020Year of publication
- Publication date
-
2020-11-16Full publication date if available
- Authors
-
Muhammad ElNokrashy, Amr Hendy, Mohamed Abdelghaffar, Mohamed Afify, Ahmed Tawfik, Hany Hassan AwadallaList of authors in order
- Landing page
-
https://arxiv.org/abs/2011.07933Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2011.07933Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2011.07933Direct OA link when available
- Concepts
-
Computer science, Sentence, Baseline (sea), Classifier (UML), Natural language processing, Task (project management), Artificial intelligence, Test set, Set (abstract data type), F1 score, Resource (disambiguation), Programming language, Engineering, Computer network, Oceanography, Geology, Systems engineeringTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4287597277 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2011.07933 |
| ids.openalex | https://openalex.org/W4287597277 |
| fwci | 0.0 |
| type | preprint |
| title | Score Combination for Improved Parallel Corpus Filtering for Low\n Resource Conditions |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9980999827384949 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T10028 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9879999756813049 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Topic Modeling |
| topics[2].id | https://openalex.org/T10601 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9574999809265137 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Handwritten Text Recognition Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.7075404524803162 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C2777530160 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6722940802574158 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q41796 |
| concepts[1].display_name | Sentence |
| concepts[2].id | https://openalex.org/C12725497 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6579276323318481 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q810247 |
| concepts[2].display_name | Baseline (sea) |
| concepts[3].id | https://openalex.org/C95623464 |
| concepts[3].level | 2 |
| concepts[3].score | 0.6396942734718323 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1096149 |
| concepts[3].display_name | Classifier (UML) |
| concepts[4].id | https://openalex.org/C204321447 |
| concepts[4].level | 1 |
| concepts[4].score | 0.6380475759506226 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[4].display_name | Natural language processing |
| concepts[5].id | https://openalex.org/C2780451532 |
| concepts[5].level | 2 |
| concepts[5].score | 0.5978789329528809 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q759676 |
| concepts[5].display_name | Task (project management) |
| concepts[6].id | https://openalex.org/C154945302 |
| concepts[6].level | 1 |
| concepts[6].score | 0.5418070554733276 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[6].display_name | Artificial intelligence |
| concepts[7].id | https://openalex.org/C169903167 |
| concepts[7].level | 2 |
| concepts[7].score | 0.507166862487793 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q3985153 |
| concepts[7].display_name | Test set |
| concepts[8].id | https://openalex.org/C177264268 |
| concepts[8].level | 2 |
| concepts[8].score | 0.491565078496933 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q1514741 |
| concepts[8].display_name | Set (abstract data type) |
| concepts[9].id | https://openalex.org/C148524875 |
| concepts[9].level | 2 |
| concepts[9].score | 0.4415281414985657 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q6975395 |
| concepts[9].display_name | F1 score |
| concepts[10].id | https://openalex.org/C206345919 |
| concepts[10].level | 2 |
| concepts[10].score | 0.41892650723457336 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q20380951 |
| concepts[10].display_name | Resource (disambiguation) |
| concepts[11].id | https://openalex.org/C199360897 |
| concepts[11].level | 1 |
| concepts[11].score | 0.0848625898361206 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[11].display_name | Programming language |
| concepts[12].id | https://openalex.org/C127413603 |
| concepts[12].level | 0 |
| concepts[12].score | 0.08393099904060364 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[12].display_name | Engineering |
| concepts[13].id | https://openalex.org/C31258907 |
| concepts[13].level | 1 |
| concepts[13].score | 0.0 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q1301371 |
| concepts[13].display_name | Computer network |
| concepts[14].id | https://openalex.org/C111368507 |
| concepts[14].level | 1 |
| concepts[14].score | 0.0 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q43518 |
| concepts[14].display_name | Oceanography |
| concepts[15].id | https://openalex.org/C127313418 |
| concepts[15].level | 0 |
| concepts[15].score | 0.0 |
| concepts[15].wikidata | https://www.wikidata.org/wiki/Q1069 |
| concepts[15].display_name | Geology |
| concepts[16].id | https://openalex.org/C201995342 |
| concepts[16].level | 1 |
| concepts[16].score | 0.0 |
| concepts[16].wikidata | https://www.wikidata.org/wiki/Q682496 |
| concepts[16].display_name | Systems engineering |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.7075404524803162 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/sentence |
| keywords[1].score | 0.6722940802574158 |
| keywords[1].display_name | Sentence |
| keywords[2].id | https://openalex.org/keywords/baseline |
| keywords[2].score | 0.6579276323318481 |
| keywords[2].display_name | Baseline (sea) |
| keywords[3].id | https://openalex.org/keywords/classifier |
| keywords[3].score | 0.6396942734718323 |
| keywords[3].display_name | Classifier (UML) |
| keywords[4].id | https://openalex.org/keywords/natural-language-processing |
| keywords[4].score | 0.6380475759506226 |
| keywords[4].display_name | Natural language processing |
| keywords[5].id | https://openalex.org/keywords/task |
| keywords[5].score | 0.5978789329528809 |
| keywords[5].display_name | Task (project management) |
| keywords[6].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[6].score | 0.5418070554733276 |
| keywords[6].display_name | Artificial intelligence |
| keywords[7].id | https://openalex.org/keywords/test-set |
| keywords[7].score | 0.507166862487793 |
| keywords[7].display_name | Test set |
| keywords[8].id | https://openalex.org/keywords/set |
| keywords[8].score | 0.491565078496933 |
| keywords[8].display_name | Set (abstract data type) |
| keywords[9].id | https://openalex.org/keywords/f1-score |
| keywords[9].score | 0.4415281414985657 |
| keywords[9].display_name | F1 score |
| keywords[10].id | https://openalex.org/keywords/resource |
| keywords[10].score | 0.41892650723457336 |
| keywords[10].display_name | Resource (disambiguation) |
| keywords[11].id | https://openalex.org/keywords/programming-language |
| keywords[11].score | 0.0848625898361206 |
| keywords[11].display_name | Programming language |
| keywords[12].id | https://openalex.org/keywords/engineering |
| keywords[12].score | 0.08393099904060364 |
| keywords[12].display_name | Engineering |
| language | |
| locations[0].id | pmh:oai:arXiv.org:2011.07933 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2011.07933 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2011.07933 |
| indexed_in | arxiv |
| authorships[0].author.id | https://openalex.org/A5073249931 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Muhammad ElNokrashy |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | ElNokrashy, Muhammad N. |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5007758583 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Amr Hendy |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Hendy, Amr |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5101190269 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Mohamed Abdelghaffar |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Abdelghaffar, Mohamed |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5021938376 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-4445-9767 |
| authorships[3].author.display_name | Mohamed Afify |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Afify, Mohamed |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5101182096 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Ahmed Tawfik |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Tawfik, Ahmed |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5030937723 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Hany Hassan Awadalla |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Awadalla, Hany Hassan |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2011.07933 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2022-07-25T00:00:00 |
| display_name | Score Combination for Improved Parallel Corpus Filtering for Low\n Resource Conditions |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9980999827384949 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| related_works | https://openalex.org/W2383111961, https://openalex.org/W2365952365, https://openalex.org/W2352448290, https://openalex.org/W2380820513, https://openalex.org/W4386566933, https://openalex.org/W2135277937, https://openalex.org/W4223433861, https://openalex.org/W3099922831, https://openalex.org/W4385571733, https://openalex.org/W4388419449 |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | pmh:oai:arXiv.org:2011.07933 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2011.07933 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2011.07933 |
| primary_location.id | pmh:oai:arXiv.org:2011.07933 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2011.07933 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2011.07933 |
| publication_date | 2020-11-16 |
| publication_year | 2020 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 15 |
| abstract_inverted_index.7% | 55 |
| abstract_inverted_index.by | 32, 49 |
| abstract_inverted_index.in | 40, 61 |
| abstract_inverted_index.on | 64 |
| abstract_inverted_index.to | 5, 26 |
| abstract_inverted_index.(1) | 14 |
| abstract_inverted_index.(2) | 23 |
| abstract_inverted_index.(3) | 35 |
| abstract_inverted_index.For | 44 |
| abstract_inverted_index.and | 29, 34, 56, 69 |
| abstract_inverted_index.for | 19 |
| abstract_inverted_index.our | 3, 52 |
| abstract_inverted_index.set | 67 |
| abstract_inverted_index.the | 6, 36, 41, 50, 65 |
| abstract_inverted_index.This | 0 |
| abstract_inverted_index.each | 20 |
| abstract_inverted_index.from | 13 |
| abstract_inverted_index.over | 59 |
| abstract_inverted_index.task | 42 |
| abstract_inverted_index.test | 66 |
| abstract_inverted_index.Khmer | 70 |
| abstract_inverted_index.LASER | 17 |
| abstract_inverted_index.WMT20 | 7 |
| abstract_inverted_index.built | 18, 25 |
| abstract_inverted_index.pairs | 31 |
| abstract_inverted_index.paper | 1 |
| abstract_inverted_index.score | 63 |
| abstract_inverted_index.shows | 54 |
| abstract_inverted_index.task. | 10 |
| abstract_inverted_index.custom | 16 |
| abstract_inverted_index.method | 53 |
| abstract_inverted_index.scores | 12, 38 |
| abstract_inverted_index.setup, | 47 |
| abstract_inverted_index.source | 21 |
| abstract_inverted_index.devkit. | 43 |
| abstract_inverted_index.included | 39 |
| abstract_inverted_index.negative | 30 |
| abstract_inverted_index.original | 37 |
| abstract_inverted_index.positive | 28 |
| abstract_inverted_index.provided | 48 |
| abstract_inverted_index.sentence | 8 |
| abstract_inverted_index.baseline, | 60 |
| abstract_inverted_index.describes | 2 |
| abstract_inverted_index.filtering | 9 |
| abstract_inverted_index.language, | 22 |
| abstract_inverted_index.sacreBLEU | 62 |
| abstract_inverted_index.finetuning | 46 |
| abstract_inverted_index.submission | 4 |
| abstract_inverted_index.the\nmBART | 45 |
| abstract_inverted_index.We\ncombine | 11 |
| abstract_inverted_index.distinguish | 27 |
| abstract_inverted_index.for\nPashto | 68 |
| abstract_inverted_index.improvement | 58 |
| abstract_inverted_index.organizers, | 51 |
| abstract_inverted_index.5%\nrelative | 57 |
| abstract_inverted_index.a\nclassifier | 24 |
| abstract_inverted_index.respectively.\n | 71 |
| abstract_inverted_index.semantic\nalignment, | 33 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile.value | 0.28831627 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | False |