Introduction To Data Preprocessing: A Review Article Swipe
YOU?
·
· 2022
· Open Access
·
· DOI: https://doi.org/10.36227/techrxiv.21068668
This study begins with an overview of data preprocessing, focusing on real-world data challenges. Before any data analysis method begins, these are the first problems that have got to be understood and resolved. In this work, the author discusses data preprocessing, like standardization and normalization including feature scaling to more readily accomplish the data classification. Finding the most informative collection of features is the goal of preprocessing to boost the classifier's performance. Include standardization is for the most part expected to take out the impact of a few quantitative highlights estimated on various scales. Besides, feature scaling is used to normalize all different numeric numbers to properly scaled numbers. The point of this part is to help analysts in picking a fitting preprocessing procedure for information investigation. The basic preprocessing methods used for the characterization of information are then addressed in this section. Fitting Python features to various information applications will be shown as concrete examples at the end of each session.
Related Topics
- Type
- review
- Language
- en
- Landing Page
- https://doi.org/10.36227/techrxiv.21068668
- OA Status
- gold
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4297777949
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4297777949Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.36227/techrxiv.21068668Digital Object Identifier
- Title
-
Introduction To Data Preprocessing: A ReviewWork title
- Type
-
reviewOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2022Year of publication
- Publication date
-
2022-09-12Full publication date if available
- Authors
-
Aditya Singh, Navneet KaurList of authors in order
- Landing page
-
https://doi.org/10.36227/techrxiv.21068668Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
goldOpen access status per OpenAlex
- OA URL
-
https://doi.org/10.36227/techrxiv.21068668Direct OA link when available
- Concepts
-
Preprocessor, Standardization, Computer science, Data pre-processing, Normalization (sociology), Python (programming language), Database normalization, Data mining, Classifier (UML), Data point, Data collection, Artificial intelligence, Pattern recognition (psychology), Statistics, Mathematics, Programming language, Anthropology, Sociology, Operating systemTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4297777949 |
|---|---|
| doi | https://doi.org/10.36227/techrxiv.21068668 |
| ids.doi | https://doi.org/10.36227/techrxiv.21068668 |
| ids.openalex | https://openalex.org/W4297777949 |
| fwci | 0.0 |
| type | review |
| title | Introduction To Data Preprocessing: A Review |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11396 |
| topics[0].field.id | https://openalex.org/fields/36 |
| topics[0].field.display_name | Health Professions |
| topics[0].score | 0.8011999726295471 |
| topics[0].domain.id | https://openalex.org/domains/4 |
| topics[0].domain.display_name | Health Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/3605 |
| topics[0].subfield.display_name | Health Information Management |
| topics[0].display_name | Artificial Intelligence in Healthcare |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C34736171 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8699131011962891 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q918333 |
| concepts[0].display_name | Preprocessor |
| concepts[1].id | https://openalex.org/C188087704 |
| concepts[1].level | 2 |
| concepts[1].score | 0.8597285747528076 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q369577 |
| concepts[1].display_name | Standardization |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.7414952516555786 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C10551718 |
| concepts[3].level | 2 |
| concepts[3].score | 0.7029426097869873 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q5227332 |
| concepts[3].display_name | Data pre-processing |
| concepts[4].id | https://openalex.org/C136886441 |
| concepts[4].level | 2 |
| concepts[4].score | 0.6671150326728821 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q926129 |
| concepts[4].display_name | Normalization (sociology) |
| concepts[5].id | https://openalex.org/C519991488 |
| concepts[5].level | 2 |
| concepts[5].score | 0.5730875134468079 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q28865 |
| concepts[5].display_name | Python (programming language) |
| concepts[6].id | https://openalex.org/C162984825 |
| concepts[6].level | 3 |
| concepts[6].score | 0.5550343990325928 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q339072 |
| concepts[6].display_name | Database normalization |
| concepts[7].id | https://openalex.org/C124101348 |
| concepts[7].level | 1 |
| concepts[7].score | 0.5392907857894897 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q172491 |
| concepts[7].display_name | Data mining |
| concepts[8].id | https://openalex.org/C95623464 |
| concepts[8].level | 2 |
| concepts[8].score | 0.48263177275657654 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q1096149 |
| concepts[8].display_name | Classifier (UML) |
| concepts[9].id | https://openalex.org/C21080849 |
| concepts[9].level | 2 |
| concepts[9].score | 0.428743451833725 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q13611879 |
| concepts[9].display_name | Data point |
| concepts[10].id | https://openalex.org/C133462117 |
| concepts[10].level | 2 |
| concepts[10].score | 0.4104929566383362 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q4929239 |
| concepts[10].display_name | Data collection |
| concepts[11].id | https://openalex.org/C154945302 |
| concepts[11].level | 1 |
| concepts[11].score | 0.36809241771698 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[11].display_name | Artificial intelligence |
| concepts[12].id | https://openalex.org/C153180895 |
| concepts[12].level | 2 |
| concepts[12].score | 0.2734466791152954 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q7148389 |
| concepts[12].display_name | Pattern recognition (psychology) |
| concepts[13].id | https://openalex.org/C105795698 |
| concepts[13].level | 1 |
| concepts[13].score | 0.10628834366798401 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q12483 |
| concepts[13].display_name | Statistics |
| concepts[14].id | https://openalex.org/C33923547 |
| concepts[14].level | 0 |
| concepts[14].score | 0.09627345204353333 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[14].display_name | Mathematics |
| concepts[15].id | https://openalex.org/C199360897 |
| concepts[15].level | 1 |
| concepts[15].score | 0.06310683488845825 |
| concepts[15].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[15].display_name | Programming language |
| concepts[16].id | https://openalex.org/C19165224 |
| concepts[16].level | 1 |
| concepts[16].score | 0.0 |
| concepts[16].wikidata | https://www.wikidata.org/wiki/Q23404 |
| concepts[16].display_name | Anthropology |
| concepts[17].id | https://openalex.org/C144024400 |
| concepts[17].level | 0 |
| concepts[17].score | 0.0 |
| concepts[17].wikidata | https://www.wikidata.org/wiki/Q21201 |
| concepts[17].display_name | Sociology |
| concepts[18].id | https://openalex.org/C111919701 |
| concepts[18].level | 1 |
| concepts[18].score | 0.0 |
| concepts[18].wikidata | https://www.wikidata.org/wiki/Q9135 |
| concepts[18].display_name | Operating system |
| keywords[0].id | https://openalex.org/keywords/preprocessor |
| keywords[0].score | 0.8699131011962891 |
| keywords[0].display_name | Preprocessor |
| keywords[1].id | https://openalex.org/keywords/standardization |
| keywords[1].score | 0.8597285747528076 |
| keywords[1].display_name | Standardization |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.7414952516555786 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/data-pre-processing |
| keywords[3].score | 0.7029426097869873 |
| keywords[3].display_name | Data pre-processing |
| keywords[4].id | https://openalex.org/keywords/normalization |
| keywords[4].score | 0.6671150326728821 |
| keywords[4].display_name | Normalization (sociology) |
| keywords[5].id | https://openalex.org/keywords/python |
| keywords[5].score | 0.5730875134468079 |
| keywords[5].display_name | Python (programming language) |
| keywords[6].id | https://openalex.org/keywords/database-normalization |
| keywords[6].score | 0.5550343990325928 |
| keywords[6].display_name | Database normalization |
| keywords[7].id | https://openalex.org/keywords/data-mining |
| keywords[7].score | 0.5392907857894897 |
| keywords[7].display_name | Data mining |
| keywords[8].id | https://openalex.org/keywords/classifier |
| keywords[8].score | 0.48263177275657654 |
| keywords[8].display_name | Classifier (UML) |
| keywords[9].id | https://openalex.org/keywords/data-point |
| keywords[9].score | 0.428743451833725 |
| keywords[9].display_name | Data point |
| keywords[10].id | https://openalex.org/keywords/data-collection |
| keywords[10].score | 0.4104929566383362 |
| keywords[10].display_name | Data collection |
| keywords[11].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[11].score | 0.36809241771698 |
| keywords[11].display_name | Artificial intelligence |
| keywords[12].id | https://openalex.org/keywords/pattern-recognition |
| keywords[12].score | 0.2734466791152954 |
| keywords[12].display_name | Pattern recognition (psychology) |
| keywords[13].id | https://openalex.org/keywords/statistics |
| keywords[13].score | 0.10628834366798401 |
| keywords[13].display_name | Statistics |
| keywords[14].id | https://openalex.org/keywords/mathematics |
| keywords[14].score | 0.09627345204353333 |
| keywords[14].display_name | Mathematics |
| keywords[15].id | https://openalex.org/keywords/programming-language |
| keywords[15].score | 0.06310683488845825 |
| keywords[15].display_name | Programming language |
| language | en |
| locations[0].id | doi:10.36227/techrxiv.21068668 |
| locations[0].is_oa | True |
| locations[0].source | |
| locations[0].license | cc-by |
| locations[0].pdf_url | |
| locations[0].version | acceptedVersion |
| locations[0].raw_type | posted-content |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | True |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.36227/techrxiv.21068668 |
| indexed_in | crossref |
| authorships[0].author.id | https://openalex.org/A5101897906 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-9089-5570 |
| authorships[0].author.display_name | Aditya Singh |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Aditya Pratap Singh |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5002219999 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-0012-6151 |
| authorships[1].author.display_name | Navneet Kaur |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Navneet Kaur |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://doi.org/10.36227/techrxiv.21068668 |
| open_access.oa_status | gold |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Introduction To Data Preprocessing: A Review |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T11396 |
| primary_topic.field.id | https://openalex.org/fields/36 |
| primary_topic.field.display_name | Health Professions |
| primary_topic.score | 0.8011999726295471 |
| primary_topic.domain.id | https://openalex.org/domains/4 |
| primary_topic.domain.display_name | Health Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/3605 |
| primary_topic.subfield.display_name | Health Information Management |
| primary_topic.display_name | Artificial Intelligence in Healthcare |
| related_works | https://openalex.org/W4239030218, https://openalex.org/W4230140792, https://openalex.org/W2006956706, https://openalex.org/W2690313894, https://openalex.org/W2605786708, https://openalex.org/W3026954374, https://openalex.org/W4297777949, https://openalex.org/W2054080489, https://openalex.org/W3107700252, https://openalex.org/W4297777938 |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | doi:10.36227/techrxiv.21068668 |
| best_oa_location.is_oa | True |
| best_oa_location.source | |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | |
| best_oa_location.version | acceptedVersion |
| best_oa_location.raw_type | posted-content |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.36227/techrxiv.21068668 |
| primary_location.id | doi:10.36227/techrxiv.21068668 |
| primary_location.is_oa | True |
| primary_location.source | |
| primary_location.license | cc-by |
| primary_location.pdf_url | |
| primary_location.version | acceptedVersion |
| primary_location.raw_type | posted-content |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | True |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.36227/techrxiv.21068668 |
| publication_date | 2022-09-12 |
| publication_year | 2022 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 87, 121 |
| abstract_inverted_index.In | 34 |
| abstract_inverted_index.an | 5 |
| abstract_inverted_index.as | 154 |
| abstract_inverted_index.at | 157 |
| abstract_inverted_index.be | 30, 152 |
| abstract_inverted_index.in | 119, 141 |
| abstract_inverted_index.is | 63, 75, 98, 115 |
| abstract_inverted_index.of | 7, 61, 66, 86, 112, 136, 160 |
| abstract_inverted_index.on | 11, 92 |
| abstract_inverted_index.to | 29, 49, 68, 81, 100, 106, 116, 147 |
| abstract_inverted_index.The | 110, 128 |
| abstract_inverted_index.all | 102 |
| abstract_inverted_index.and | 32, 44 |
| abstract_inverted_index.any | 16 |
| abstract_inverted_index.are | 22, 138 |
| abstract_inverted_index.end | 159 |
| abstract_inverted_index.few | 88 |
| abstract_inverted_index.for | 76, 125, 133 |
| abstract_inverted_index.got | 28 |
| abstract_inverted_index.out | 83 |
| abstract_inverted_index.the | 23, 37, 53, 57, 64, 70, 77, 84, 134, 158 |
| abstract_inverted_index.This | 1 |
| abstract_inverted_index.data | 8, 13, 17, 40, 54 |
| abstract_inverted_index.each | 161 |
| abstract_inverted_index.goal | 65 |
| abstract_inverted_index.have | 27 |
| abstract_inverted_index.help | 117 |
| abstract_inverted_index.like | 42 |
| abstract_inverted_index.more | 50 |
| abstract_inverted_index.most | 58, 78 |
| abstract_inverted_index.part | 79, 114 |
| abstract_inverted_index.take | 82 |
| abstract_inverted_index.that | 26 |
| abstract_inverted_index.then | 139 |
| abstract_inverted_index.this | 35, 113, 142 |
| abstract_inverted_index.used | 99, 132 |
| abstract_inverted_index.will | 151 |
| abstract_inverted_index.with | 4 |
| abstract_inverted_index.basic | 129 |
| abstract_inverted_index.boost | 69 |
| abstract_inverted_index.first | 24 |
| abstract_inverted_index.point | 111 |
| abstract_inverted_index.shown | 153 |
| abstract_inverted_index.study | 2 |
| abstract_inverted_index.these | 21 |
| abstract_inverted_index.work, | 36 |
| abstract_inverted_index.Before | 15 |
| abstract_inverted_index.Python | 145 |
| abstract_inverted_index.author | 38 |
| abstract_inverted_index.begins | 3 |
| abstract_inverted_index.impact | 85 |
| abstract_inverted_index.method | 19 |
| abstract_inverted_index.scaled | 108 |
| abstract_inverted_index.Finding | 56 |
| abstract_inverted_index.Fitting | 144 |
| abstract_inverted_index.Include | 73 |
| abstract_inverted_index.begins, | 20 |
| abstract_inverted_index.feature | 47, 96 |
| abstract_inverted_index.fitting | 122 |
| abstract_inverted_index.methods | 131 |
| abstract_inverted_index.numbers | 105 |
| abstract_inverted_index.numeric | 104 |
| abstract_inverted_index.picking | 120 |
| abstract_inverted_index.readily | 51 |
| abstract_inverted_index.scales. | 94 |
| abstract_inverted_index.scaling | 48, 97 |
| abstract_inverted_index.various | 93, 148 |
| abstract_inverted_index.Besides, | 95 |
| abstract_inverted_index.analysis | 18 |
| abstract_inverted_index.analysts | 118 |
| abstract_inverted_index.concrete | 155 |
| abstract_inverted_index.examples | 156 |
| abstract_inverted_index.expected | 80 |
| abstract_inverted_index.features | 62, 146 |
| abstract_inverted_index.focusing | 10 |
| abstract_inverted_index.numbers. | 109 |
| abstract_inverted_index.overview | 6 |
| abstract_inverted_index.problems | 25 |
| abstract_inverted_index.properly | 107 |
| abstract_inverted_index.section. | 143 |
| abstract_inverted_index.session. | 162 |
| abstract_inverted_index.<p> | 0 |
| abstract_inverted_index.addressed | 140 |
| abstract_inverted_index.different | 103 |
| abstract_inverted_index.discusses | 39 |
| abstract_inverted_index.estimated | 91 |
| abstract_inverted_index.including | 46 |
| abstract_inverted_index.normalize | 101 |
| abstract_inverted_index.procedure | 124 |
| abstract_inverted_index.resolved. | 33 |
| abstract_inverted_index.</p> | 163 |
| abstract_inverted_index.accomplish | 52 |
| abstract_inverted_index.collection | 60 |
| abstract_inverted_index.highlights | 90 |
| abstract_inverted_index.real-world | 12 |
| abstract_inverted_index.understood | 31 |
| abstract_inverted_index.challenges. | 14 |
| abstract_inverted_index.information | 126, 137, 149 |
| abstract_inverted_index.informative | 59 |
| abstract_inverted_index.applications | 150 |
| abstract_inverted_index.classifier's | 71 |
| abstract_inverted_index.performance. | 72 |
| abstract_inverted_index.quantitative | 89 |
| abstract_inverted_index.normalization | 45 |
| abstract_inverted_index.preprocessing | 67, 123, 130 |
| abstract_inverted_index.investigation. | 127 |
| abstract_inverted_index.preprocessing, | 9, 41 |
| abstract_inverted_index.classification. | 55 |
| abstract_inverted_index.standardization | 43, 74 |
| abstract_inverted_index.characterization | 135 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile.value | 0.19185379 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | False |