An Extensive Study of Similarity and Dissimilarity Measures Used for Text Document Clustering using K-means Algorithm Article Swipe
YOU?
·
· 2018
· Open Access
·
· DOI: https://doi.org/10.5815/ijitcs.2018.09.08
In today's world tremendous amount of unstructured data, especially text, is being generated through various sources.This massive amount of data has lead the researchers to focus on employing data mining techniques to analyse and cluster them for an efficient browsing and searching mechanisms.The clustering methods like k-means algorithm perform through measuring the relationship between the data objects.Accurate clustering is based on the similarity or dissimilarity measure that is defined to evaluate the homogeneity of the documents.A variety of measures have been proposed up to this date.However, all of them are not suitable to be used in the k-means algorithm.In this paper, an extensive study is done to compare and analyse the performance of eight well-known similarity and dissimilarity measures that are applicable to the kmeans clustering approach.For experiment purpose, four text document data sets are used and the results are reported.
Related Topics
- Type
- article
- Language
- en
- Landing Page
- https://doi.org/10.5815/ijitcs.2018.09.08
- http://www.mecs-press.org/ijitcs/ijitcs-v10-n9/IJITCS-V10-N9-8.pdf
- OA Status
- diamond
- Cited By
- 4
- References
- 38
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W2891254276
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W2891254276Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.5815/ijitcs.2018.09.08Digital Object Identifier
- Title
-
An Extensive Study of Similarity and Dissimilarity Measures Used for Text Document Clustering using K-means AlgorithmWork title
- Type
-
articleOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2018Year of publication
- Publication date
-
2018-09-08Full publication date if available
- Authors
-
Maedeh Afzali, Suresh KumarList of authors in order
- Landing page
-
https://doi.org/10.5815/ijitcs.2018.09.08Publisher landing page
- PDF URL
-
https://www.mecs-press.org/ijitcs/ijitcs-v10-n9/IJITCS-V10-N9-8.pdfDirect link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
diamondOpen access status per OpenAlex
- OA URL
-
https://www.mecs-press.org/ijitcs/ijitcs-v10-n9/IJITCS-V10-N9-8.pdfDirect OA link when available
- Concepts
-
Cluster analysis, Computer science, Similarity (geometry), Data mining, Homogeneity (statistics), k-means clustering, Similarity measure, Document clustering, Information retrieval, Artificial intelligence, Machine learning, Image (mathematics)Top concepts (fields/topics) attached by OpenAlex
- Cited by
-
4Total citation count in OpenAlex
- Citations by year (recent)
-
2021: 3, 2019: 1Per-year citation counts (last 5 years)
- References (count)
-
38Number of works referenced by this work
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W2891254276 |
|---|---|
| doi | https://doi.org/10.5815/ijitcs.2018.09.08 |
| ids.doi | https://doi.org/10.5815/ijitcs.2018.09.08 |
| ids.mag | 2891254276 |
| ids.openalex | https://openalex.org/W2891254276 |
| fwci | 0.79424984 |
| type | article |
| title | An Extensive Study of Similarity and Dissimilarity Measures Used for Text Document Clustering using K-means Algorithm |
| biblio.issue | 9 |
| biblio.volume | 10 |
| biblio.last_page | 73 |
| biblio.first_page | 64 |
| topics[0].id | https://openalex.org/T10637 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.998199999332428 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Advanced Clustering Algorithms Research |
| topics[1].id | https://openalex.org/T11550 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9970999956130981 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Text and Document Classification Technologies |
| topics[2].id | https://openalex.org/T10538 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9968000054359436 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1710 |
| topics[2].subfield.display_name | Information Systems |
| topics[2].display_name | Data Mining Algorithms and Applications |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C73555534 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8315404653549194 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q622825 |
| concepts[0].display_name | Cluster analysis |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.8229950666427612 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C103278499 |
| concepts[2].level | 3 |
| concepts[2].score | 0.7006394267082214 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q254465 |
| concepts[2].display_name | Similarity (geometry) |
| concepts[3].id | https://openalex.org/C124101348 |
| concepts[3].level | 1 |
| concepts[3].score | 0.5717145204544067 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q172491 |
| concepts[3].display_name | Data mining |
| concepts[4].id | https://openalex.org/C142259097 |
| concepts[4].level | 2 |
| concepts[4].score | 0.5438533425331116 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q5891314 |
| concepts[4].display_name | Homogeneity (statistics) |
| concepts[5].id | https://openalex.org/C207968372 |
| concepts[5].level | 3 |
| concepts[5].score | 0.5105053186416626 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q310401 |
| concepts[5].display_name | k-means clustering |
| concepts[6].id | https://openalex.org/C2776517306 |
| concepts[6].level | 2 |
| concepts[6].score | 0.502363920211792 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q29017317 |
| concepts[6].display_name | Similarity measure |
| concepts[7].id | https://openalex.org/C177937566 |
| concepts[7].level | 3 |
| concepts[7].score | 0.4537234604358673 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q4223102 |
| concepts[7].display_name | Document clustering |
| concepts[8].id | https://openalex.org/C23123220 |
| concepts[8].level | 1 |
| concepts[8].score | 0.441837877035141 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q816826 |
| concepts[8].display_name | Information retrieval |
| concepts[9].id | https://openalex.org/C154945302 |
| concepts[9].level | 1 |
| concepts[9].score | 0.26398760080337524 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[9].display_name | Artificial intelligence |
| concepts[10].id | https://openalex.org/C119857082 |
| concepts[10].level | 1 |
| concepts[10].score | 0.18040496110916138 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q2539 |
| concepts[10].display_name | Machine learning |
| concepts[11].id | https://openalex.org/C115961682 |
| concepts[11].level | 2 |
| concepts[11].score | 0.14322620630264282 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q860623 |
| concepts[11].display_name | Image (mathematics) |
| keywords[0].id | https://openalex.org/keywords/cluster-analysis |
| keywords[0].score | 0.8315404653549194 |
| keywords[0].display_name | Cluster analysis |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.8229950666427612 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/similarity |
| keywords[2].score | 0.7006394267082214 |
| keywords[2].display_name | Similarity (geometry) |
| keywords[3].id | https://openalex.org/keywords/data-mining |
| keywords[3].score | 0.5717145204544067 |
| keywords[3].display_name | Data mining |
| keywords[4].id | https://openalex.org/keywords/homogeneity |
| keywords[4].score | 0.5438533425331116 |
| keywords[4].display_name | Homogeneity (statistics) |
| keywords[5].id | https://openalex.org/keywords/k-means-clustering |
| keywords[5].score | 0.5105053186416626 |
| keywords[5].display_name | k-means clustering |
| keywords[6].id | https://openalex.org/keywords/similarity-measure |
| keywords[6].score | 0.502363920211792 |
| keywords[6].display_name | Similarity measure |
| keywords[7].id | https://openalex.org/keywords/document-clustering |
| keywords[7].score | 0.4537234604358673 |
| keywords[7].display_name | Document clustering |
| keywords[8].id | https://openalex.org/keywords/information-retrieval |
| keywords[8].score | 0.441837877035141 |
| keywords[8].display_name | Information retrieval |
| keywords[9].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[9].score | 0.26398760080337524 |
| keywords[9].display_name | Artificial intelligence |
| keywords[10].id | https://openalex.org/keywords/machine-learning |
| keywords[10].score | 0.18040496110916138 |
| keywords[10].display_name | Machine learning |
| keywords[11].id | https://openalex.org/keywords/image |
| keywords[11].score | 0.14322620630264282 |
| keywords[11].display_name | Image (mathematics) |
| language | en |
| locations[0].id | doi:10.5815/ijitcs.2018.09.08 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S2483513740 |
| locations[0].source.issn | 2074-9007, 2074-9015 |
| locations[0].source.type | journal |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | 2074-9007 |
| locations[0].source.is_core | True |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | International Journal of Information Technology and Computer Science |
| locations[0].source.host_organization | |
| locations[0].source.host_organization_name | |
| locations[0].license | |
| locations[0].pdf_url | http://www.mecs-press.org/ijitcs/ijitcs-v10-n9/IJITCS-V10-N9-8.pdf |
| locations[0].version | publishedVersion |
| locations[0].raw_type | journal-article |
| locations[0].license_id | |
| locations[0].is_accepted | True |
| locations[0].is_published | True |
| locations[0].raw_source_name | International Journal of Information Technology and Computer Science |
| locations[0].landing_page_url | https://doi.org/10.5815/ijitcs.2018.09.08 |
| indexed_in | crossref |
| authorships[0].author.id | https://openalex.org/A5065564498 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Maedeh Afzali |
| authorships[0].countries | IN |
| authorships[0].affiliations[0].institution_ids | https://openalex.org/I55016150 |
| authorships[0].affiliations[0].raw_affiliation_string | Manav Rachna International Institute of Research and Studies, Faridabad, 121004, India |
| authorships[0].institutions[0].id | https://openalex.org/I55016150 |
| authorships[0].institutions[0].ror | https://ror.org/02kf4r633 |
| authorships[0].institutions[0].type | education |
| authorships[0].institutions[0].lineage | https://openalex.org/I4411591084, https://openalex.org/I55016150 |
| authorships[0].institutions[0].country_code | IN |
| authorships[0].institutions[0].display_name | Manav Rachna International Institute of Research and Studies |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Maedeh Afzali |
| authorships[0].is_corresponding | False |
| authorships[0].raw_affiliation_strings | Manav Rachna International Institute of Research and Studies, Faridabad, 121004, India |
| authorships[1].author.id | https://openalex.org/A5084921480 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-7774-7052 |
| authorships[1].author.display_name | Suresh Kumar |
| authorships[1].countries | IN |
| authorships[1].affiliations[0].institution_ids | https://openalex.org/I55016150 |
| authorships[1].affiliations[0].raw_affiliation_string | Manav Rachna International Institute of Research and Studies, Faridabad, 121004, India |
| authorships[1].institutions[0].id | https://openalex.org/I55016150 |
| authorships[1].institutions[0].ror | https://ror.org/02kf4r633 |
| authorships[1].institutions[0].type | education |
| authorships[1].institutions[0].lineage | https://openalex.org/I4411591084, https://openalex.org/I55016150 |
| authorships[1].institutions[0].country_code | IN |
| authorships[1].institutions[0].display_name | Manav Rachna International Institute of Research and Studies |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Suresh Kumar |
| authorships[1].is_corresponding | False |
| authorships[1].raw_affiliation_strings | Manav Rachna International Institute of Research and Studies, Faridabad, 121004, India |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | http://www.mecs-press.org/ijitcs/ijitcs-v10-n9/IJITCS-V10-N9-8.pdf |
| open_access.oa_status | diamond |
| open_access.any_repository_has_fulltext | False |
| created_date | 2018-09-27T00:00:00 |
| display_name | An Extensive Study of Similarity and Dissimilarity Measures Used for Text Document Clustering using K-means Algorithm |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T10637 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.998199999332428 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Advanced Clustering Algorithms Research |
| related_works | https://openalex.org/W2319693127, https://openalex.org/W308539617, https://openalex.org/W2072263576, https://openalex.org/W2474567666, https://openalex.org/W1940044583, https://openalex.org/W2806903871, https://openalex.org/W4320802053, https://openalex.org/W2185976384, https://openalex.org/W2841402245, https://openalex.org/W2904779692 |
| cited_by_count | 4 |
| counts_by_year[0].year | 2021 |
| counts_by_year[0].cited_by_count | 3 |
| counts_by_year[1].year | 2019 |
| counts_by_year[1].cited_by_count | 1 |
| locations_count | 1 |
| best_oa_location.id | doi:10.5815/ijitcs.2018.09.08 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S2483513740 |
| best_oa_location.source.issn | 2074-9007, 2074-9015 |
| best_oa_location.source.type | journal |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | 2074-9007 |
| best_oa_location.source.is_core | True |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | International Journal of Information Technology and Computer Science |
| best_oa_location.source.host_organization | |
| best_oa_location.source.host_organization_name | |
| best_oa_location.license | |
| best_oa_location.pdf_url | http://www.mecs-press.org/ijitcs/ijitcs-v10-n9/IJITCS-V10-N9-8.pdf |
| best_oa_location.version | publishedVersion |
| best_oa_location.raw_type | journal-article |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | True |
| best_oa_location.raw_source_name | International Journal of Information Technology and Computer Science |
| best_oa_location.landing_page_url | https://doi.org/10.5815/ijitcs.2018.09.08 |
| primary_location.id | doi:10.5815/ijitcs.2018.09.08 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S2483513740 |
| primary_location.source.issn | 2074-9007, 2074-9015 |
| primary_location.source.type | journal |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | 2074-9007 |
| primary_location.source.is_core | True |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | International Journal of Information Technology and Computer Science |
| primary_location.source.host_organization | |
| primary_location.source.host_organization_name | |
| primary_location.license | |
| primary_location.pdf_url | http://www.mecs-press.org/ijitcs/ijitcs-v10-n9/IJITCS-V10-N9-8.pdf |
| primary_location.version | publishedVersion |
| primary_location.raw_type | journal-article |
| primary_location.license_id | |
| primary_location.is_accepted | True |
| primary_location.is_published | True |
| primary_location.raw_source_name | International Journal of Information Technology and Computer Science |
| primary_location.landing_page_url | https://doi.org/10.5815/ijitcs.2018.09.08 |
| publication_date | 2018-09-08 |
| publication_year | 2018 |
| referenced_works | https://openalex.org/W416578099, https://openalex.org/W6680704940, https://openalex.org/W6681405209, https://openalex.org/W2805194070, https://openalex.org/W6604108467, https://openalex.org/W6637231022, https://openalex.org/W2884644959, https://openalex.org/W6674809819, https://openalex.org/W6683154208, https://openalex.org/W6734652691, https://openalex.org/W6684620663, https://openalex.org/W1978394996, https://openalex.org/W2125214008, https://openalex.org/W2596841986, https://openalex.org/W2309564524, https://openalex.org/W1989898761, https://openalex.org/W2884573272, https://openalex.org/W6632088959, https://openalex.org/W2992603957, https://openalex.org/W2308071406, https://openalex.org/W2806253405, https://openalex.org/W3176183162, https://openalex.org/W2145252566, https://openalex.org/W2027499890, https://openalex.org/W6677759223, https://openalex.org/W2129250947, https://openalex.org/W1651093245, https://openalex.org/W2011430131, https://openalex.org/W2156741031, https://openalex.org/W2165612380, https://openalex.org/W2142827986, https://openalex.org/W4285719527, https://openalex.org/W100415715, https://openalex.org/W2140190241, https://openalex.org/W2098162425, https://openalex.org/W2597485909, https://openalex.org/W2273660186, https://openalex.org/W2117594971 |
| referenced_works_count | 38 |
| abstract_inverted_index.In | 0 |
| abstract_inverted_index.an | 37, 101 |
| abstract_inverted_index.be | 93 |
| abstract_inverted_index.in | 95 |
| abstract_inverted_index.is | 10, 58, 67, 104 |
| abstract_inverted_index.of | 5, 18, 73, 77, 87, 112 |
| abstract_inverted_index.on | 26, 60 |
| abstract_inverted_index.or | 63 |
| abstract_inverted_index.to | 24, 31, 69, 83, 92, 106, 122 |
| abstract_inverted_index.up | 82 |
| abstract_inverted_index.all | 86 |
| abstract_inverted_index.and | 33, 40, 108, 116, 136 |
| abstract_inverted_index.are | 89, 120, 134, 139 |
| abstract_inverted_index.for | 36 |
| abstract_inverted_index.has | 20 |
| abstract_inverted_index.not | 90 |
| abstract_inverted_index.the | 22, 51, 54, 61, 71, 74, 96, 110, 123, 137 |
| abstract_inverted_index.been | 80 |
| abstract_inverted_index.data | 19, 28, 55, 132 |
| abstract_inverted_index.done | 105 |
| abstract_inverted_index.four | 129 |
| abstract_inverted_index.have | 79 |
| abstract_inverted_index.lead | 21 |
| abstract_inverted_index.like | 45 |
| abstract_inverted_index.sets | 133 |
| abstract_inverted_index.text | 130 |
| abstract_inverted_index.that | 66, 119 |
| abstract_inverted_index.them | 35, 88 |
| abstract_inverted_index.this | 84, 99 |
| abstract_inverted_index.used | 94, 135 |
| abstract_inverted_index.based | 59 |
| abstract_inverted_index.being | 11 |
| abstract_inverted_index.data, | 7 |
| abstract_inverted_index.eight | 113 |
| abstract_inverted_index.focus | 25 |
| abstract_inverted_index.study | 103 |
| abstract_inverted_index.text, | 9 |
| abstract_inverted_index.world | 2 |
| abstract_inverted_index.amount | 4, 17 |
| abstract_inverted_index.kmeans | 124 |
| abstract_inverted_index.mining | 29 |
| abstract_inverted_index.paper, | 100 |
| abstract_inverted_index.analyse | 32, 109 |
| abstract_inverted_index.between | 53 |
| abstract_inverted_index.cluster | 34 |
| abstract_inverted_index.compare | 107 |
| abstract_inverted_index.defined | 68 |
| abstract_inverted_index.k-means | 46, 97 |
| abstract_inverted_index.massive | 16 |
| abstract_inverted_index.measure | 65 |
| abstract_inverted_index.methods | 44 |
| abstract_inverted_index.perform | 48 |
| abstract_inverted_index.results | 138 |
| abstract_inverted_index.through | 13, 49 |
| abstract_inverted_index.today's | 1 |
| abstract_inverted_index.variety | 76 |
| abstract_inverted_index.various | 14 |
| abstract_inverted_index.browsing | 39 |
| abstract_inverted_index.document | 131 |
| abstract_inverted_index.evaluate | 70 |
| abstract_inverted_index.measures | 78, 118 |
| abstract_inverted_index.proposed | 81 |
| abstract_inverted_index.purpose, | 128 |
| abstract_inverted_index.suitable | 91 |
| abstract_inverted_index.algorithm | 47 |
| abstract_inverted_index.efficient | 38 |
| abstract_inverted_index.employing | 27 |
| abstract_inverted_index.extensive | 102 |
| abstract_inverted_index.generated | 12 |
| abstract_inverted_index.measuring | 50 |
| abstract_inverted_index.reported. | 140 |
| abstract_inverted_index.searching | 41 |
| abstract_inverted_index.applicable | 121 |
| abstract_inverted_index.clustering | 43, 57, 125 |
| abstract_inverted_index.especially | 8 |
| abstract_inverted_index.experiment | 127 |
| abstract_inverted_index.similarity | 62, 115 |
| abstract_inverted_index.techniques | 30 |
| abstract_inverted_index.tremendous | 3 |
| abstract_inverted_index.well-known | 114 |
| abstract_inverted_index.documents.A | 75 |
| abstract_inverted_index.homogeneity | 72 |
| abstract_inverted_index.performance | 111 |
| abstract_inverted_index.researchers | 23 |
| abstract_inverted_index.algorithm.In | 98 |
| abstract_inverted_index.approach.For | 126 |
| abstract_inverted_index.relationship | 52 |
| abstract_inverted_index.sources.This | 15 |
| abstract_inverted_index.unstructured | 6 |
| abstract_inverted_index.date.However, | 85 |
| abstract_inverted_index.dissimilarity | 64, 117 |
| abstract_inverted_index.mechanisms.The | 42 |
| abstract_inverted_index.objects.Accurate | 56 |
| cited_by_percentile_year.max | 97 |
| cited_by_percentile_year.min | 90 |
| countries_distinct_count | 1 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile.value | 0.77357787 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | False |