SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2308.00994
Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern neural networks, this is prohibitively labor-intensive and thus impractical. Inspired by recent developments in generative models, this paper explores the potential of synthetic data to address the data imbalance problem. To be specific, our method, dubbed SYNAuG, leverages synthetic data to equalize the unbalanced distribution of training data. Our experiments demonstrate that, although a domain gap between real and synthetic data exists, training with SYNAuG followed by fine-tuning with a few real samples allows to achieve impressive performance on diverse tasks with different data imbalance issues, surpassing existing task-specific methods for the same purpose.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2308.00994
- https://arxiv.org/pdf/2308.00994
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4385963790
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4385963790Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2308.00994Digital Object Identifier
- Title
-
SYNAuG: Exploiting Synthetic Data for Data Imbalance ProblemsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-08-02Full publication date if available
- Authors
-
Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun OhList of authors in order
- Landing page
-
https://arxiv.org/abs/2308.00994Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2308.00994Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2308.00994Direct OA link when available
- Concepts
-
Leverage (statistics), Synthetic data, Computer science, Data quality, Artificial intelligence, Domain (mathematical analysis), Generative grammar, Generative model, Machine learning, Data mining, Engineering, Mathematics, Mathematical analysis, Operations management, Metric (unit)Top concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4385963790 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2308.00994 |
| ids.doi | https://doi.org/10.48550/arxiv.2308.00994 |
| ids.openalex | https://openalex.org/W4385963790 |
| fwci | |
| type | preprint |
| title | SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10775 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9894000291824341 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Generative Adversarial Networks and Image Synthesis |
| topics[1].id | https://openalex.org/T10862 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9805999994277954 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | AI in cancer detection |
| topics[2].id | https://openalex.org/T11652 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9538999795913696 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Imbalanced Data Classification Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C153083717 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7762671709060669 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q6535263 |
| concepts[0].display_name | Leverage (statistics) |
| concepts[1].id | https://openalex.org/C160920958 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7400339841842651 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q7662746 |
| concepts[1].display_name | Synthetic data |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.7328596711158752 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C24756922 |
| concepts[3].level | 3 |
| concepts[3].score | 0.47420862317085266 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1757694 |
| concepts[3].display_name | Data quality |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.4488150477409363 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| concepts[5].id | https://openalex.org/C36503486 |
| concepts[5].level | 2 |
| concepts[5].score | 0.4476255774497986 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11235244 |
| concepts[5].display_name | Domain (mathematical analysis) |
| concepts[6].id | https://openalex.org/C39890363 |
| concepts[6].level | 2 |
| concepts[6].score | 0.43422698974609375 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q36108 |
| concepts[6].display_name | Generative grammar |
| concepts[7].id | https://openalex.org/C167966045 |
| concepts[7].level | 3 |
| concepts[7].score | 0.4160422682762146 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q5532625 |
| concepts[7].display_name | Generative model |
| concepts[8].id | https://openalex.org/C119857082 |
| concepts[8].level | 1 |
| concepts[8].score | 0.40722566843032837 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q2539 |
| concepts[8].display_name | Machine learning |
| concepts[9].id | https://openalex.org/C124101348 |
| concepts[9].level | 1 |
| concepts[9].score | 0.35741353034973145 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q172491 |
| concepts[9].display_name | Data mining |
| concepts[10].id | https://openalex.org/C127413603 |
| concepts[10].level | 0 |
| concepts[10].score | 0.07868444919586182 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[10].display_name | Engineering |
| concepts[11].id | https://openalex.org/C33923547 |
| concepts[11].level | 0 |
| concepts[11].score | 0.07302379608154297 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[11].display_name | Mathematics |
| concepts[12].id | https://openalex.org/C134306372 |
| concepts[12].level | 1 |
| concepts[12].score | 0.0 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q7754 |
| concepts[12].display_name | Mathematical analysis |
| concepts[13].id | https://openalex.org/C21547014 |
| concepts[13].level | 1 |
| concepts[13].score | 0.0 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q1423657 |
| concepts[13].display_name | Operations management |
| concepts[14].id | https://openalex.org/C176217482 |
| concepts[14].level | 2 |
| concepts[14].score | 0.0 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q860554 |
| concepts[14].display_name | Metric (unit) |
| keywords[0].id | https://openalex.org/keywords/leverage |
| keywords[0].score | 0.7762671709060669 |
| keywords[0].display_name | Leverage (statistics) |
| keywords[1].id | https://openalex.org/keywords/synthetic-data |
| keywords[1].score | 0.7400339841842651 |
| keywords[1].display_name | Synthetic data |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.7328596711158752 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/data-quality |
| keywords[3].score | 0.47420862317085266 |
| keywords[3].display_name | Data quality |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.4488150477409363 |
| keywords[4].display_name | Artificial intelligence |
| keywords[5].id | https://openalex.org/keywords/domain |
| keywords[5].score | 0.4476255774497986 |
| keywords[5].display_name | Domain (mathematical analysis) |
| keywords[6].id | https://openalex.org/keywords/generative-grammar |
| keywords[6].score | 0.43422698974609375 |
| keywords[6].display_name | Generative grammar |
| keywords[7].id | https://openalex.org/keywords/generative-model |
| keywords[7].score | 0.4160422682762146 |
| keywords[7].display_name | Generative model |
| keywords[8].id | https://openalex.org/keywords/machine-learning |
| keywords[8].score | 0.40722566843032837 |
| keywords[8].display_name | Machine learning |
| keywords[9].id | https://openalex.org/keywords/data-mining |
| keywords[9].score | 0.35741353034973145 |
| keywords[9].display_name | Data mining |
| keywords[10].id | https://openalex.org/keywords/engineering |
| keywords[10].score | 0.07868444919586182 |
| keywords[10].display_name | Engineering |
| keywords[11].id | https://openalex.org/keywords/mathematics |
| keywords[11].score | 0.07302379608154297 |
| keywords[11].display_name | Mathematics |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2308.00994 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2308.00994 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2308.00994 |
| locations[1].id | doi:10.48550/arxiv.2308.00994 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2308.00994 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5007559554 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Moon Ye-Bin |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Ye-Bin, Moon |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5090138474 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-9543-3770 |
| authorships[1].author.display_name | Nam Hyeon-Woo |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Hyeon-Woo, Nam |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5006397694 |
| authorships[2].author.orcid | https://orcid.org/0000-0003-4110-4386 |
| authorships[2].author.display_name | Wonseok Choi |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Choi, Wonseok |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5039999823 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-5452-1085 |
| authorships[3].author.display_name | Nayeong Kim |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Kim, Nayeong |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5060343759 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-4567-9091 |
| authorships[4].author.display_name | Suha Kwak |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Kwak, Suha |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5078114111 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-0468-1571 |
| authorships[5].author.display_name | Tae-Hyun Oh |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Oh, Tae-Hyun |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2308.00994 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10775 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9894000291824341 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Generative Adversarial Networks and Image Synthesis |
| related_works | https://openalex.org/W4365211920, https://openalex.org/W3014948380, https://openalex.org/W4380551139, https://openalex.org/W4317695495, https://openalex.org/W4387506531, https://openalex.org/W4238433571, https://openalex.org/W3174044702, https://openalex.org/W2967848559, https://openalex.org/W4299831724, https://openalex.org/W4283803360 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2308.00994 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2308.00994 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2308.00994 |
| primary_location.id | pmh:oai:arXiv.org:2308.00994 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2308.00994 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2308.00994 |
| publication_date | 2023-08-02 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.A | 21 |
| abstract_inverted_index.a | 90, 106 |
| abstract_inverted_index.To | 67 |
| abstract_inverted_index.be | 68 |
| abstract_inverted_index.by | 47, 103 |
| abstract_inverted_index.in | 2, 14, 50 |
| abstract_inverted_index.is | 24, 40 |
| abstract_inverted_index.of | 35, 58, 82 |
| abstract_inverted_index.on | 115 |
| abstract_inverted_index.to | 7, 25, 61, 77, 111 |
| abstract_inverted_index.Our | 85 |
| abstract_inverted_index.and | 18, 43, 95 |
| abstract_inverted_index.but | 30 |
| abstract_inverted_index.few | 107 |
| abstract_inverted_index.for | 127 |
| abstract_inverted_index.gap | 92 |
| abstract_inverted_index.our | 70 |
| abstract_inverted_index.the | 32, 56, 63, 79, 128 |
| abstract_inverted_index.Data | 0 |
| abstract_inverted_index.data | 4, 60, 64, 76, 97, 120 |
| abstract_inverted_index.from | 10 |
| abstract_inverted_index.real | 94, 108 |
| abstract_inverted_index.same | 129 |
| abstract_inverted_index.this | 39, 53 |
| abstract_inverted_index.thus | 44 |
| abstract_inverted_index.turn | 15 |
| abstract_inverted_index.with | 100, 105, 118 |
| abstract_inverted_index.data, | 29 |
| abstract_inverted_index.data. | 84 |
| abstract_inverted_index.given | 31 |
| abstract_inverted_index.leads | 6 |
| abstract_inverted_index.often | 5 |
| abstract_inverted_index.paper | 54 |
| abstract_inverted_index.scale | 34 |
| abstract_inverted_index.tasks | 117 |
| abstract_inverted_index.that, | 88 |
| abstract_inverted_index.which | 13 |
| abstract_inverted_index.SYNAuG | 101 |
| abstract_inverted_index.allows | 110 |
| abstract_inverted_index.biased | 8 |
| abstract_inverted_index.causes | 16 |
| abstract_inverted_index.curate | 27 |
| abstract_inverted_index.domain | 91 |
| abstract_inverted_index.dubbed | 72 |
| abstract_inverted_index.modern | 36 |
| abstract_inverted_index.neural | 37 |
| abstract_inverted_index.recent | 48 |
| abstract_inverted_index.social | 19 |
| abstract_inverted_index.SYNAuG, | 73 |
| abstract_inverted_index.achieve | 112 |
| abstract_inverted_index.address | 62 |
| abstract_inverted_index.between | 93 |
| abstract_inverted_index.diverse | 116 |
| abstract_inverted_index.ethical | 17 |
| abstract_inverted_index.exists, | 98 |
| abstract_inverted_index.issues, | 122 |
| abstract_inverted_index.issues. | 20 |
| abstract_inverted_index.method, | 71 |
| abstract_inverted_index.methods | 126 |
| abstract_inverted_index.models, | 12, 52 |
| abstract_inverted_index.samples | 109 |
| abstract_inverted_index.trained | 11 |
| abstract_inverted_index.Inspired | 46 |
| abstract_inverted_index.although | 89 |
| abstract_inverted_index.enormous | 33 |
| abstract_inverted_index.equalize | 78 |
| abstract_inverted_index.existing | 124 |
| abstract_inverted_index.explores | 55 |
| abstract_inverted_index.followed | 102 |
| abstract_inverted_index.problem. | 66 |
| abstract_inverted_index.purpose. | 130 |
| abstract_inverted_index.solution | 23 |
| abstract_inverted_index.training | 3, 28, 83, 99 |
| abstract_inverted_index.carefully | 26 |
| abstract_inverted_index.different | 119 |
| abstract_inverted_index.imbalance | 1, 65, 121 |
| abstract_inverted_index.leverages | 74 |
| abstract_inverted_index.networks, | 38 |
| abstract_inverted_index.potential | 57 |
| abstract_inverted_index.specific, | 69 |
| abstract_inverted_index.synthetic | 59, 75, 96 |
| abstract_inverted_index.generative | 51 |
| abstract_inverted_index.impressive | 113 |
| abstract_inverted_index.surpassing | 123 |
| abstract_inverted_index.unbalanced | 80 |
| abstract_inverted_index.demonstrate | 87 |
| abstract_inverted_index.experiments | 86 |
| abstract_inverted_index.fine-tuning | 104 |
| abstract_inverted_index.performance | 114 |
| abstract_inverted_index.predictions | 9 |
| abstract_inverted_index.developments | 49 |
| abstract_inverted_index.distribution | 81 |
| abstract_inverted_index.impractical. | 45 |
| abstract_inverted_index.prohibitively | 41 |
| abstract_inverted_index.task-specific | 125 |
| abstract_inverted_index.labor-intensive | 42 |
| abstract_inverted_index.straightforward | 22 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |