Context binning, model clustering and adaptivity for data compression of genetic data Article Swipe
Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to just remove low bits of the context, proposed context binning automatically optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a relatively small number of states. The second proposed approach: model clustering uses k-means clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2201.05028
- https://arxiv.org/pdf/2201.05028
- OA Status
- green
- Cited By
- 1
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4229007260
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4229007260Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2201.05028Digital Object Identifier
- Title
-
Context binning, model clustering and adaptivity for data compression of genetic dataWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2022Year of publication
- Publication date
-
2022-01-13Full publication date if available
- Authors
-
Jarek DudaList of authors in order
- Landing page
-
https://arxiv.org/abs/2201.05028Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2201.05028Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2201.05028Direct OA link when available
- Concepts
-
Cluster analysis, Computer science, Data mining, Context (archaeology), Centroid, Bin, Data compression, Statistical model, Artificial intelligence, Algorithm, Biology, PaleontologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
1Total citation count in OpenAlex
- Citations by year (recent)
-
2022: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4229007260 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2201.05028 |
| ids.doi | https://doi.org/10.48550/arxiv.2201.05028 |
| ids.openalex | https://openalex.org/W4229007260 |
| fwci | |
| type | preprint |
| title | Context binning, model clustering and adaptivity for data compression of genetic data |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11269 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9955999851226807 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Algorithms and Data Compression |
| topics[1].id | https://openalex.org/T10885 |
| topics[1].field.id | https://openalex.org/fields/13 |
| topics[1].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[1].score | 0.9904999732971191 |
| topics[1].domain.id | https://openalex.org/domains/1 |
| topics[1].domain.display_name | Life Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1312 |
| topics[1].subfield.display_name | Molecular Biology |
| topics[1].display_name | Gene expression and cancer classification |
| topics[2].id | https://openalex.org/T12814 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9297000169754028 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Gaussian Processes and Bayesian Inference |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C73555534 |
| concepts[0].level | 2 |
| concepts[0].score | 0.841984748840332 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q622825 |
| concepts[0].display_name | Cluster analysis |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.7958203554153442 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C124101348 |
| concepts[2].level | 1 |
| concepts[2].score | 0.6176714301109314 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q172491 |
| concepts[2].display_name | Data mining |
| concepts[3].id | https://openalex.org/C2779343474 |
| concepts[3].level | 2 |
| concepts[3].score | 0.6003673076629639 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q3109175 |
| concepts[3].display_name | Context (archaeology) |
| concepts[4].id | https://openalex.org/C146599234 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4796174466609955 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q511093 |
| concepts[4].display_name | Centroid |
| concepts[5].id | https://openalex.org/C156273044 |
| concepts[5].level | 2 |
| concepts[5].score | 0.44903871417045593 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q4913766 |
| concepts[5].display_name | Bin |
| concepts[6].id | https://openalex.org/C78548338 |
| concepts[6].level | 2 |
| concepts[6].score | 0.43014127016067505 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q2493 |
| concepts[6].display_name | Data compression |
| concepts[7].id | https://openalex.org/C114289077 |
| concepts[7].level | 2 |
| concepts[7].score | 0.4119323790073395 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q3284399 |
| concepts[7].display_name | Statistical model |
| concepts[8].id | https://openalex.org/C154945302 |
| concepts[8].level | 1 |
| concepts[8].score | 0.3331141471862793 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[8].display_name | Artificial intelligence |
| concepts[9].id | https://openalex.org/C11413529 |
| concepts[9].level | 1 |
| concepts[9].score | 0.27677375078201294 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q8366 |
| concepts[9].display_name | Algorithm |
| concepts[10].id | https://openalex.org/C86803240 |
| concepts[10].level | 0 |
| concepts[10].score | 0.0 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[10].display_name | Biology |
| concepts[11].id | https://openalex.org/C151730666 |
| concepts[11].level | 1 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q7205 |
| concepts[11].display_name | Paleontology |
| keywords[0].id | https://openalex.org/keywords/cluster-analysis |
| keywords[0].score | 0.841984748840332 |
| keywords[0].display_name | Cluster analysis |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.7958203554153442 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/data-mining |
| keywords[2].score | 0.6176714301109314 |
| keywords[2].display_name | Data mining |
| keywords[3].id | https://openalex.org/keywords/context |
| keywords[3].score | 0.6003673076629639 |
| keywords[3].display_name | Context (archaeology) |
| keywords[4].id | https://openalex.org/keywords/centroid |
| keywords[4].score | 0.4796174466609955 |
| keywords[4].display_name | Centroid |
| keywords[5].id | https://openalex.org/keywords/bin |
| keywords[5].score | 0.44903871417045593 |
| keywords[5].display_name | Bin |
| keywords[6].id | https://openalex.org/keywords/data-compression |
| keywords[6].score | 0.43014127016067505 |
| keywords[6].display_name | Data compression |
| keywords[7].id | https://openalex.org/keywords/statistical-model |
| keywords[7].score | 0.4119323790073395 |
| keywords[7].display_name | Statistical model |
| keywords[8].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[8].score | 0.3331141471862793 |
| keywords[8].display_name | Artificial intelligence |
| keywords[9].id | https://openalex.org/keywords/algorithm |
| keywords[9].score | 0.27677375078201294 |
| keywords[9].display_name | Algorithm |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2201.05028 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2201.05028 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2201.05028 |
| locations[1].id | doi:10.48550/arxiv.2201.05028 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2201.05028 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5109420627 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Jarek Duda |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Duda, Jarek |
| authorships[0].is_corresponding | True |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2201.05028 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2022-05-08T00:00:00 |
| display_name | Context binning, model clustering and adaptivity for data compression of genetic data |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11269 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9955999851226807 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Algorithms and Data Compression |
| related_works | https://openalex.org/W2107701374, https://openalex.org/W2950072893, https://openalex.org/W1616588898, https://openalex.org/W4249504934, https://openalex.org/W2183416055, https://openalex.org/W2381926679, https://openalex.org/W2568867011, https://openalex.org/W1994114538, https://openalex.org/W2413205705, https://openalex.org/W2735644334 |
| cited_by_count | 1 |
| counts_by_year[0].year | 2022 |
| counts_by_year[0].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2201.05028 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2201.05028 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2201.05028 |
| primary_location.id | pmh:oai:arXiv.org:2201.05028 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2201.05028 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2201.05028 |
| publication_date | 2022-01-13 |
| publication_year | 2022 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 73, 97 |
| abstract_inverted_index.as | 54 |
| abstract_inverted_index.be | 104 |
| abstract_inverted_index.in | 10, 88 |
| abstract_inverted_index.is | 37 |
| abstract_inverted_index.it | 36 |
| abstract_inverted_index.of | 2, 26, 44, 77, 90 |
| abstract_inverted_index.to | 39, 95, 103, 119 |
| abstract_inverted_index.(as | 100 |
| abstract_inverted_index.The | 79 |
| abstract_inverted_index.all | 64 |
| abstract_inverted_index.and | 32 |
| abstract_inverted_index.are | 112 |
| abstract_inverted_index.few | 98 |
| abstract_inverted_index.for | 108 |
| abstract_inverted_index.low | 42 |
| abstract_inverted_index.the | 45 |
| abstract_inverted_index.way | 61 |
| abstract_inverted_index.This | 20 |
| abstract_inverted_index.also | 67, 113 |
| abstract_inverted_index.bits | 43 |
| abstract_inverted_index.data | 12, 121 |
| abstract_inverted_index.e.g. | 25, 106 |
| abstract_inverted_index.each | 109 |
| abstract_inverted_index.from | 8, 68 |
| abstract_inverted_index.huge | 6 |
| abstract_inverted_index.into | 72 |
| abstract_inverted_index.just | 40 |
| abstract_inverted_index.some | 116 |
| abstract_inverted_index.such | 52 |
| abstract_inverted_index.this | 60 |
| abstract_inverted_index.uses | 85 |
| abstract_inverted_index.very | 69 |
| abstract_inverted_index.what | 14 |
| abstract_inverted_index.Rapid | 0 |
| abstract_inverted_index.There | 111 |
| abstract_inverted_index.While | 35 |
| abstract_inverted_index.large | 70 |
| abstract_inverted_index.means | 5 |
| abstract_inverted_index.model | 33, 83 |
| abstract_inverted_index.read. | 110 |
| abstract_inverted_index.small | 75 |
| abstract_inverted_index.space | 89 |
| abstract_inverted_index.their | 11 |
| abstract_inverted_index.better | 16 |
| abstract_inverted_index.chosen | 105 |
| abstract_inverted_index.growth | 1 |
| abstract_inverted_index.models | 99 |
| abstract_inverted_index.nearly | 63 |
| abstract_inverted_index.number | 76 |
| abstract_inverted_index.remove | 41 |
| abstract_inverted_index.second | 80 |
| abstract_inverted_index.useful | 65 |
| abstract_inverted_index.article | 21 |
| abstract_inverted_index.binning | 31, 49 |
| abstract_inverted_index.briefly | 114 |
| abstract_inverted_index.cluster | 101 |
| abstract_inverted_index.context | 30, 48 |
| abstract_inverted_index.general | 91 |
| abstract_inverted_index.genetic | 3 |
| abstract_inverted_index.include | 120 |
| abstract_inverted_index.k-means | 86 |
| abstract_inverted_index.models, | 28, 93 |
| abstract_inverted_index.models. | 19 |
| abstract_inverted_index.popular | 38 |
| abstract_inverted_index.savings | 7 |
| abstract_inverted_index.states. | 78 |
| abstract_inverted_index.tabled: | 55 |
| abstract_inverted_index.allowing | 94 |
| abstract_inverted_index.context, | 46 |
| abstract_inverted_index.optimize | 96 |
| abstract_inverted_index.proposed | 47, 81 |
| abstract_inverted_index.proposes | 22 |
| abstract_inverted_index.requires | 15 |
| abstract_inverted_index.approach: | 82 |
| abstract_inverted_index.contexts, | 71 |
| abstract_inverted_index.databases | 4 |
| abstract_inverted_index.discussed | 115 |
| abstract_inverted_index.optimizes | 51 |
| abstract_inverted_index.reduction | 53 |
| abstract_inverted_index.adaptivity | 117 |
| abstract_inverted_index.centroids) | 102 |
| abstract_inverted_index.clustering | 84, 87 |
| abstract_inverted_index.especially | 29 |
| abstract_inverted_index.extracting | 62 |
| abstract_inverted_index.relatively | 74 |
| abstract_inverted_index.separately | 107 |
| abstract_inverted_index.techniques | 118 |
| abstract_inverted_index.Markov-like | 27 |
| abstract_inverted_index.automatized | 23 |
| abstract_inverted_index.clustering. | 34 |
| abstract_inverted_index.determining | 57 |
| abstract_inverted_index.inexpensive | 17 |
| abstract_inverted_index.information | 66 |
| abstract_inverted_index.probability | 58 |
| abstract_inverted_index.statistical | 18, 92 |
| abstract_inverted_index.compression, | 13 |
| abstract_inverted_index.improvements | 9 |
| abstract_inverted_index.automatically | 50 |
| abstract_inverted_index.distribution, | 59 |
| abstract_inverted_index.optimizations | 24 |
| abstract_inverted_index.non-stationarity. | 122 |
| abstract_inverted_index.state=bin[context] | 56 |
| cited_by_percentile_year | |
| corresponding_author_ids | https://openalex.org/A5109420627 |
| countries_distinct_count | 0 |
| institutions_distinct_count | 1 |
| citation_normalized_percentile |