Big Data Model Building Using Dimension Reduction and Sample Selection Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.6084/m9.figshare.24233113
It is difficult to handle the extraordinary data volume generated in many fields with current computational resources and techniques. This is very challenging when applying conventional statistical methods to big data. A common approach is to partition full data into smaller subdata for purposes such as training, testing, and validation. The primary purpose of training data is to represent the full data. To achieve this goal, the selection of training subdata becomes pivotal in retaining essential characteristics of the full data. Recently, several procedures have been proposed to select “optimal design points” as training subdata under pre-specified models, such as linear regression and logistic regression. However, these subdata will not be “optimal” if the assumed model is not appropriate. Furthermore, such subdata cannot be useful to build alternative models because it is not an appropriate representative sample of the full data. In this article, we propose a novel algorithm for better model building and prediction via a process of selecting a “good” training sample. The proposed subdata can retain most characteristics of the original big data. It is also more robust that one can fit various response model and select the optimal model. Supplementary materials for this article are available online.
Related Topics
- Type
- dataset
- Language
- en
- Landing Page
- https://doi.org/10.6084/m9.figshare.24233113
- OA Status
- gold
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4394182682
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4394182682Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.6084/m9.figshare.24233113Digital Object Identifier
- Title
-
Big Data Model Building Using Dimension Reduction and Sample SelectionWork title
- Type
-
datasetOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-01-01Full publication date if available
- Authors
-
Lih‐Yuan Deng, Ching‐Chi Yang, Dale Bowman, Dennis K. J. Lin, Henry Horng‐Shing LuList of authors in order
- Landing page
-
https://doi.org/10.6084/m9.figshare.24233113Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
goldOpen access status per OpenAlex
- OA URL
-
https://doi.org/10.6084/m9.figshare.24233113Direct OA link when available
- Concepts
-
Dimensionality reduction, Reduction (mathematics), Dimension (graph theory), Selection (genetic algorithm), Sample (material), Data reduction, Big data, Computer science, Model selection, Data mining, Statistics, Mathematics, Artificial intelligence, Chromatography, Chemistry, Geometry, Pure mathematicsTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4394182682 |
|---|---|
| doi | https://doi.org/10.6084/m9.figshare.24233113 |
| ids.doi | https://doi.org/10.6084/m9.figshare.24233113 |
| ids.openalex | https://openalex.org/W4394182682 |
| fwci | |
| type | dataset |
| title | Big Data Model Building Using Dimension Reduction and Sample Selection |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10057 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.7821000218391418 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Face and Expression Recognition |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C70518039 |
| concepts[0].level | 2 |
| concepts[0].score | 0.6902843117713928 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q16000077 |
| concepts[0].display_name | Dimensionality reduction |
| concepts[1].id | https://openalex.org/C111335779 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6743491888046265 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q3454686 |
| concepts[1].display_name | Reduction (mathematics) |
| concepts[2].id | https://openalex.org/C33676613 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6284178495407104 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q13415176 |
| concepts[2].display_name | Dimension (graph theory) |
| concepts[3].id | https://openalex.org/C81917197 |
| concepts[3].level | 2 |
| concepts[3].score | 0.6232971549034119 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q628760 |
| concepts[3].display_name | Selection (genetic algorithm) |
| concepts[4].id | https://openalex.org/C198531522 |
| concepts[4].level | 2 |
| concepts[4].score | 0.6149148941040039 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q485146 |
| concepts[4].display_name | Sample (material) |
| concepts[5].id | https://openalex.org/C153914771 |
| concepts[5].level | 2 |
| concepts[5].score | 0.5188475847244263 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q5227343 |
| concepts[5].display_name | Data reduction |
| concepts[6].id | https://openalex.org/C75684735 |
| concepts[6].level | 2 |
| concepts[6].score | 0.47976216673851013 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q858810 |
| concepts[6].display_name | Big data |
| concepts[7].id | https://openalex.org/C41008148 |
| concepts[7].level | 0 |
| concepts[7].score | 0.4621067941188812 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[7].display_name | Computer science |
| concepts[8].id | https://openalex.org/C93959086 |
| concepts[8].level | 2 |
| concepts[8].score | 0.44270262122154236 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q6888345 |
| concepts[8].display_name | Model selection |
| concepts[9].id | https://openalex.org/C124101348 |
| concepts[9].level | 1 |
| concepts[9].score | 0.3506966829299927 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q172491 |
| concepts[9].display_name | Data mining |
| concepts[10].id | https://openalex.org/C105795698 |
| concepts[10].level | 1 |
| concepts[10].score | 0.3446066975593567 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q12483 |
| concepts[10].display_name | Statistics |
| concepts[11].id | https://openalex.org/C33923547 |
| concepts[11].level | 0 |
| concepts[11].score | 0.2860097289085388 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[11].display_name | Mathematics |
| concepts[12].id | https://openalex.org/C154945302 |
| concepts[12].level | 1 |
| concepts[12].score | 0.2671310305595398 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[12].display_name | Artificial intelligence |
| concepts[13].id | https://openalex.org/C43617362 |
| concepts[13].level | 1 |
| concepts[13].score | 0.08689111471176147 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q170050 |
| concepts[13].display_name | Chromatography |
| concepts[14].id | https://openalex.org/C185592680 |
| concepts[14].level | 0 |
| concepts[14].score | 0.07536444067955017 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q2329 |
| concepts[14].display_name | Chemistry |
| concepts[15].id | https://openalex.org/C2524010 |
| concepts[15].level | 1 |
| concepts[15].score | 0.0 |
| concepts[15].wikidata | https://www.wikidata.org/wiki/Q8087 |
| concepts[15].display_name | Geometry |
| concepts[16].id | https://openalex.org/C202444582 |
| concepts[16].level | 1 |
| concepts[16].score | 0.0 |
| concepts[16].wikidata | https://www.wikidata.org/wiki/Q837863 |
| concepts[16].display_name | Pure mathematics |
| keywords[0].id | https://openalex.org/keywords/dimensionality-reduction |
| keywords[0].score | 0.6902843117713928 |
| keywords[0].display_name | Dimensionality reduction |
| keywords[1].id | https://openalex.org/keywords/reduction |
| keywords[1].score | 0.6743491888046265 |
| keywords[1].display_name | Reduction (mathematics) |
| keywords[2].id | https://openalex.org/keywords/dimension |
| keywords[2].score | 0.6284178495407104 |
| keywords[2].display_name | Dimension (graph theory) |
| keywords[3].id | https://openalex.org/keywords/selection |
| keywords[3].score | 0.6232971549034119 |
| keywords[3].display_name | Selection (genetic algorithm) |
| keywords[4].id | https://openalex.org/keywords/sample |
| keywords[4].score | 0.6149148941040039 |
| keywords[4].display_name | Sample (material) |
| keywords[5].id | https://openalex.org/keywords/data-reduction |
| keywords[5].score | 0.5188475847244263 |
| keywords[5].display_name | Data reduction |
| keywords[6].id | https://openalex.org/keywords/big-data |
| keywords[6].score | 0.47976216673851013 |
| keywords[6].display_name | Big data |
| keywords[7].id | https://openalex.org/keywords/computer-science |
| keywords[7].score | 0.4621067941188812 |
| keywords[7].display_name | Computer science |
| keywords[8].id | https://openalex.org/keywords/model-selection |
| keywords[8].score | 0.44270262122154236 |
| keywords[8].display_name | Model selection |
| keywords[9].id | https://openalex.org/keywords/data-mining |
| keywords[9].score | 0.3506966829299927 |
| keywords[9].display_name | Data mining |
| keywords[10].id | https://openalex.org/keywords/statistics |
| keywords[10].score | 0.3446066975593567 |
| keywords[10].display_name | Statistics |
| keywords[11].id | https://openalex.org/keywords/mathematics |
| keywords[11].score | 0.2860097289085388 |
| keywords[11].display_name | Mathematics |
| keywords[12].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[12].score | 0.2671310305595398 |
| keywords[12].display_name | Artificial intelligence |
| keywords[13].id | https://openalex.org/keywords/chromatography |
| keywords[13].score | 0.08689111471176147 |
| keywords[13].display_name | Chromatography |
| keywords[14].id | https://openalex.org/keywords/chemistry |
| keywords[14].score | 0.07536444067955017 |
| keywords[14].display_name | Chemistry |
| language | en |
| locations[0].id | doi:10.6084/m9.figshare.24233113 |
| locations[0].is_oa | True |
| locations[0].source | |
| locations[0].license | cc-by |
| locations[0].pdf_url | |
| locations[0].version | |
| locations[0].raw_type | dataset |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.6084/m9.figshare.24233113 |
| indexed_in | datacite |
| authorships[0].author.id | https://openalex.org/A5030657372 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-7350-9751 |
| authorships[0].author.display_name | Lih‐Yuan Deng |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Lih-Yuan Deng |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5086031117 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-2517-2709 |
| authorships[1].author.display_name | Ching‐Chi Yang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Ching-Chi Yang |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5073224585 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-3874-5257 |
| authorships[2].author.display_name | Dale Bowman |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Dale Bowman |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5057669385 |
| authorships[3].author.orcid | https://orcid.org/0000-0003-2552-7709 |
| authorships[3].author.display_name | Dennis K. J. Lin |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Dennis K. J. Lin |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5052673069 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-4392-3361 |
| authorships[4].author.display_name | Henry Horng‐Shing Lu |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Henry Horng-Shing Lu |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://doi.org/10.6084/m9.figshare.24233113 |
| open_access.oa_status | gold |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Big Data Model Building Using Dimension Reduction and Sample Selection |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10057 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.7821000218391418 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Face and Expression Recognition |
| related_works | https://openalex.org/W4390608645, https://openalex.org/W4247566972, https://openalex.org/W2960264696, https://openalex.org/W3090563135, https://openalex.org/W2497432351, https://openalex.org/W4206777497, https://openalex.org/W2910064364, https://openalex.org/W4255224757, https://openalex.org/W1993351602, https://openalex.org/W2170142494 |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | doi:10.6084/m9.figshare.24233113 |
| best_oa_location.is_oa | True |
| best_oa_location.source | |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | |
| best_oa_location.version | |
| best_oa_location.raw_type | dataset |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.6084/m9.figshare.24233113 |
| primary_location.id | doi:10.6084/m9.figshare.24233113 |
| primary_location.is_oa | True |
| primary_location.source | |
| primary_location.license | cc-by |
| primary_location.pdf_url | |
| primary_location.version | |
| primary_location.raw_type | dataset |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.6084/m9.figshare.24233113 |
| publication_date | 2023-01-01 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.A | 31 |
| abstract_inverted_index.a | 146, 156, 160 |
| abstract_inverted_index.In | 141 |
| abstract_inverted_index.It | 0, 176 |
| abstract_inverted_index.To | 62 |
| abstract_inverted_index.an | 133 |
| abstract_inverted_index.as | 45, 92, 99 |
| abstract_inverted_index.be | 110, 123 |
| abstract_inverted_index.if | 112 |
| abstract_inverted_index.in | 10, 73 |
| abstract_inverted_index.is | 1, 20, 34, 56, 116, 131, 177 |
| abstract_inverted_index.it | 130 |
| abstract_inverted_index.of | 53, 68, 77, 137, 158, 171 |
| abstract_inverted_index.to | 3, 28, 35, 57, 87, 125 |
| abstract_inverted_index.we | 144 |
| abstract_inverted_index.The | 50, 164 |
| abstract_inverted_index.and | 17, 48, 102, 153, 188 |
| abstract_inverted_index.are | 198 |
| abstract_inverted_index.big | 29, 174 |
| abstract_inverted_index.can | 167, 183 |
| abstract_inverted_index.fit | 184 |
| abstract_inverted_index.for | 42, 149, 195 |
| abstract_inverted_index.not | 109, 117, 132 |
| abstract_inverted_index.one | 182 |
| abstract_inverted_index.the | 5, 59, 66, 78, 113, 138, 172, 190 |
| abstract_inverted_index.via | 155 |
| abstract_inverted_index.This | 19 |
| abstract_inverted_index.also | 178 |
| abstract_inverted_index.been | 85 |
| abstract_inverted_index.data | 7, 38, 55 |
| abstract_inverted_index.full | 37, 60, 79, 139 |
| abstract_inverted_index.have | 84 |
| abstract_inverted_index.into | 39 |
| abstract_inverted_index.many | 11 |
| abstract_inverted_index.more | 179 |
| abstract_inverted_index.most | 169 |
| abstract_inverted_index.such | 44, 98, 120 |
| abstract_inverted_index.that | 181 |
| abstract_inverted_index.this | 64, 142, 196 |
| abstract_inverted_index.very | 21 |
| abstract_inverted_index.when | 23 |
| abstract_inverted_index.will | 108 |
| abstract_inverted_index.with | 13 |
| abstract_inverted_index.build | 126 |
| abstract_inverted_index.data. | 30, 61, 80, 140, 175 |
| abstract_inverted_index.goal, | 65 |
| abstract_inverted_index.model | 115, 151, 187 |
| abstract_inverted_index.novel | 147 |
| abstract_inverted_index.these | 106 |
| abstract_inverted_index.under | 95 |
| abstract_inverted_index.better | 150 |
| abstract_inverted_index.cannot | 122 |
| abstract_inverted_index.common | 32 |
| abstract_inverted_index.design | 90 |
| abstract_inverted_index.fields | 12 |
| abstract_inverted_index.handle | 4 |
| abstract_inverted_index.linear | 100 |
| abstract_inverted_index.model. | 192 |
| abstract_inverted_index.models | 128 |
| abstract_inverted_index.retain | 168 |
| abstract_inverted_index.robust | 180 |
| abstract_inverted_index.sample | 136 |
| abstract_inverted_index.select | 88, 189 |
| abstract_inverted_index.useful | 124 |
| abstract_inverted_index.volume | 8 |
| abstract_inverted_index.achieve | 63 |
| abstract_inverted_index.article | 197 |
| abstract_inverted_index.assumed | 114 |
| abstract_inverted_index.because | 129 |
| abstract_inverted_index.becomes | 71 |
| abstract_inverted_index.current | 14 |
| abstract_inverted_index.methods | 27 |
| abstract_inverted_index.models, | 97 |
| abstract_inverted_index.online. | 200 |
| abstract_inverted_index.optimal | 191 |
| abstract_inverted_index.pivotal | 72 |
| abstract_inverted_index.primary | 51 |
| abstract_inverted_index.process | 157 |
| abstract_inverted_index.propose | 145 |
| abstract_inverted_index.purpose | 52 |
| abstract_inverted_index.sample. | 163 |
| abstract_inverted_index.several | 82 |
| abstract_inverted_index.smaller | 40 |
| abstract_inverted_index.subdata | 41, 70, 94, 107, 121, 166 |
| abstract_inverted_index.various | 185 |
| abstract_inverted_index.However, | 105 |
| abstract_inverted_index.applying | 24 |
| abstract_inverted_index.approach | 33 |
| abstract_inverted_index.article, | 143 |
| abstract_inverted_index.building | 152 |
| abstract_inverted_index.logistic | 103 |
| abstract_inverted_index.original | 173 |
| abstract_inverted_index.proposed | 86, 165 |
| abstract_inverted_index.purposes | 43 |
| abstract_inverted_index.response | 186 |
| abstract_inverted_index.testing, | 47 |
| abstract_inverted_index.training | 54, 69, 93, 162 |
| abstract_inverted_index.Recently, | 81 |
| abstract_inverted_index.algorithm | 148 |
| abstract_inverted_index.available | 199 |
| abstract_inverted_index.difficult | 2 |
| abstract_inverted_index.essential | 75 |
| abstract_inverted_index.generated | 9 |
| abstract_inverted_index.materials | 194 |
| abstract_inverted_index.partition | 36 |
| abstract_inverted_index.points” | 91 |
| abstract_inverted_index.represent | 58 |
| abstract_inverted_index.resources | 16 |
| abstract_inverted_index.retaining | 74 |
| abstract_inverted_index.selecting | 159 |
| abstract_inverted_index.selection | 67 |
| abstract_inverted_index.training, | 46 |
| abstract_inverted_index.prediction | 154 |
| abstract_inverted_index.procedures | 83 |
| abstract_inverted_index.regression | 101 |
| abstract_inverted_index.“good” | 161 |
| abstract_inverted_index.“optimal | 89 |
| abstract_inverted_index.alternative | 127 |
| abstract_inverted_index.appropriate | 134 |
| abstract_inverted_index.challenging | 22 |
| abstract_inverted_index.regression. | 104 |
| abstract_inverted_index.statistical | 26 |
| abstract_inverted_index.techniques. | 18 |
| abstract_inverted_index.validation. | 49 |
| abstract_inverted_index.Furthermore, | 119 |
| abstract_inverted_index.appropriate. | 118 |
| abstract_inverted_index.conventional | 25 |
| abstract_inverted_index.Supplementary | 193 |
| abstract_inverted_index.computational | 15 |
| abstract_inverted_index.extraordinary | 6 |
| abstract_inverted_index.pre-specified | 96 |
| abstract_inverted_index.“optimal” | 111 |
| abstract_inverted_index.representative | 135 |
| abstract_inverted_index.characteristics | 76, 170 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile |