Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2505.12973
Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift - utilizing rich offline datasets to inform the development of fast, rule-based methods suitable for latency-sensitive accessibility applications like screen readers. To this end, we improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak. Our results show an approximate 30% improvement in homograph disambiguation accuracy for the deep learning-based and eSpeak systems.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2505.12973
- https://arxiv.org/pdf/2505.12973
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4417286717
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4417286717Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2505.12973Digital Object Identifier
- Title
-
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-19Full publication date if available
- Authors
-
Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. RabieeList of authors in order
- Landing page
-
https://arxiv.org/abs/2505.12973Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2505.12973Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2505.12973Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4417286717 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2505.12973 |
| ids.doi | https://doi.org/10.48550/arxiv.2505.12973 |
| ids.openalex | https://openalex.org/W4417286717 |
| fwci | |
| type | preprint |
| title | Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2505.12973 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2505.12973 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2505.12973 |
| locations[1].id | doi:10.48550/arxiv.2505.12973 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2505.12973 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5114353445 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Mahta Fetrat Qharabagh |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Qharabagh, Mahta Fetrat |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5114353446 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Zahra Dehghanian |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Dehghanian, Zahra |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5063512925 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-9835-4493 |
| authorships[2].author.display_name | Hamid R. Rabiee |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Rabiee, Hamid R. |
| authorships[2].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2505.12973 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-12-12T22:30:26.305170 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2505.12973 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2505.12973 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2505.12973 |
| primary_location.id | pmh:oai:arXiv.org:2505.12973 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2505.12973 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2505.12973 |
| publication_date | 2025-05-19 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.- | 100 |
| abstract_inverted_index.a | 3, 61, 85, 97, 136 |
| abstract_inverted_index.In | 51 |
| abstract_inverted_index.To | 121 |
| abstract_inverted_index.an | 145 |
| abstract_inverted_index.as | 44 |
| abstract_inverted_index.by | 80 |
| abstract_inverted_index.in | 6, 149 |
| abstract_inverted_index.is | 16, 25 |
| abstract_inverted_index.it | 82 |
| abstract_inverted_index.of | 109, 127 |
| abstract_inverted_index.to | 83, 105 |
| abstract_inverted_index.we | 54, 59, 94, 124 |
| abstract_inverted_index.(1) | 18 |
| abstract_inverted_index.(2) | 30 |
| abstract_inverted_index.30% | 147 |
| abstract_inverted_index.G2P | 89, 132 |
| abstract_inverted_index.Our | 142 |
| abstract_inverted_index.and | 21, 27, 29, 47, 76, 157 |
| abstract_inverted_index.for | 11, 40, 64, 91, 96, 114, 153 |
| abstract_inverted_index.its | 78 |
| abstract_inverted_index.one | 126 |
| abstract_inverted_index.the | 69, 107, 128, 154 |
| abstract_inverted_index.This | 14 |
| abstract_inverted_index.both | 56 |
| abstract_inverted_index.deep | 87, 155 |
| abstract_inverted_index.end, | 123 |
| abstract_inverted_index.fast | 137 |
| abstract_inverted_index.into | 135 |
| abstract_inverted_index.like | 118 |
| abstract_inverted_index.most | 129 |
| abstract_inverted_index.rich | 102 |
| abstract_inverted_index.show | 144 |
| abstract_inverted_index.such | 43 |
| abstract_inverted_index.them | 38 |
| abstract_inverted_index.this | 52, 74, 122 |
| abstract_inverted_index.(G2P) | 8 |
| abstract_inverted_index.fast, | 110 |
| abstract_inverted_index.other | 48 |
| abstract_inverted_index.shift | 99 |
| abstract_inverted_index.First, | 58 |
| abstract_inverted_index.eSpeak | 158 |
| abstract_inverted_index.inform | 106 |
| abstract_inverted_index.making | 37 |
| abstract_inverted_index.paper, | 53 |
| abstract_inverted_index.screen | 45, 119 |
| abstract_inverted_index.system | 90 |
| abstract_inverted_index.tools. | 50 |
| abstract_inverted_index.Second, | 93 |
| abstract_inverted_index.address | 55 |
| abstract_inverted_index.costly, | 28 |
| abstract_inverted_index.dataset | 71 |
| abstract_inverted_index.eSpeak, | 134 |
| abstract_inverted_index.eSpeak. | 141 |
| abstract_inverted_index.enhance | 84 |
| abstract_inverted_index.improve | 125 |
| abstract_inverted_index.issues. | 57 |
| abstract_inverted_index.methods | 112 |
| abstract_inverted_index.offline | 103 |
| abstract_inverted_index.propose | 60 |
| abstract_inverted_index.readers | 46 |
| abstract_inverted_index.remains | 2 |
| abstract_inverted_index.results | 143 |
| abstract_inverted_index.through | 73 |
| abstract_inverted_index.HomoFast | 140 |
| abstract_inverted_index.HomoRich | 70 |
| abstract_inverted_index.Persian. | 92 |
| abstract_inverted_index.accuracy | 152 |
| abstract_inverted_index.advocate | 95 |
| abstract_inverted_index.applying | 81 |
| abstract_inverted_index.balanced | 20 |
| abstract_inverted_index.creating | 19 |
| abstract_inverted_index.datasets | 24, 104 |
| abstract_inverted_index.latency, | 36 |
| abstract_inverted_index.paradigm | 98 |
| abstract_inverted_index.pipeline | 63 |
| abstract_inverted_index.readers. | 120 |
| abstract_inverted_index.specific | 31 |
| abstract_inverted_index.suitable | 113 |
| abstract_inverted_index.systems, | 133 |
| abstract_inverted_index.systems. | 159 |
| abstract_inverted_index.twofold: | 17 |
| abstract_inverted_index.version, | 139 |
| abstract_inverted_index.Homograph | 0 |
| abstract_inverted_index.challenge | 5, 15 |
| abstract_inverted_index.datasets, | 67 |
| abstract_inverted_index.generated | 72 |
| abstract_inverted_index.homograph | 23, 150 |
| abstract_inverted_index.introduce | 34, 68 |
| abstract_inverted_index.pipeline, | 75 |
| abstract_inverted_index.real-time | 41 |
| abstract_inverted_index.utilizing | 101 |
| abstract_inverted_index.additional | 35 |
| abstract_inverted_index.especially | 10 |
| abstract_inverted_index.languages. | 13 |
| abstract_inverted_index.rule-based | 111, 131 |
| abstract_inverted_index.strategies | 33 |
| abstract_inverted_index.unsuitable | 39 |
| abstract_inverted_index.well-known | 130 |
| abstract_inverted_index.approximate | 146 |
| abstract_inverted_index.conversion, | 9 |
| abstract_inverted_index.demonstrate | 77 |
| abstract_inverted_index.development | 108 |
| abstract_inverted_index.improvement | 148 |
| abstract_inverted_index.significant | 4 |
| abstract_inverted_index.applications | 42, 117 |
| abstract_inverted_index.constructing | 65 |
| abstract_inverted_index.low-resource | 12 |
| abstract_inverted_index.accessibility | 49, 116 |
| abstract_inverted_index.comprehensive | 22 |
| abstract_inverted_index.effectiveness | 79 |
| abstract_inverted_index.disambiguation | 1, 32, 151 |
| abstract_inverted_index.learning-based | 88, 156 |
| abstract_inverted_index.semi-automated | 62 |
| abstract_inverted_index.homograph-aware | 138 |
| abstract_inverted_index.labor-intensive | 26 |
| abstract_inverted_index.state-of-the-art | 86 |
| abstract_inverted_index.homograph-focused | 66 |
| abstract_inverted_index.latency-sensitive | 115 |
| abstract_inverted_index.grapheme-to-phoneme | 7 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |