Clustering FunFams using sequence embeddings improves EC purity Article Swipe
YOU?
·
· 2021
· Open Access
·
· DOI: https://doi.org/10.1101/2021.01.21.427551
Motivation Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be “pure”, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22,830 of 203,639) contain EC annotations and of those, 7% (1,526 of 22,830) have inconsistent functional annotations. Results We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. Availability Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- https://doi.org/10.1101/2021.01.21.427551
- https://www.biorxiv.org/content/biorxiv/early/2021/04/01/2021.01.21.427551.full.pdf
- OA Status
- green
- Cited By
- 4
- References
- 40
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W3125128655
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W3125128655Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.1101/2021.01.21.427551Digital Object Identifier
- Title
-
Clustering FunFams using sequence embeddings improves EC purityWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2021Year of publication
- Publication date
-
2021-01-21Full publication date if available
- Authors
-
Maria Littmann, Nicola Bordin, Michael Heinzinger, Konstantin Schütze, Christian Dallago, Christine Orengo, Burkhard RostList of authors in order
- Landing page
-
https://doi.org/10.1101/2021.01.21.427551Publisher landing page
- PDF URL
-
https://www.biorxiv.org/content/biorxiv/early/2021/04/01/2021.01.21.427551.full.pdfDirect link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://www.biorxiv.org/content/biorxiv/early/2021/04/01/2021.01.21.427551.full.pdfDirect OA link when available
- Concepts
-
Consistency (knowledge bases), Cluster analysis, Function (biology), Inference, Similarity (geometry), Sequence (biology), Computer science, Computational biology, Cluster (spacecraft), Code (set theory), Outlier, Encoding (memory), Data mining, Theoretical computer science, Biology, Artificial intelligence, Set (abstract data type), Genetics, Image (mathematics), Programming languageTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
4Total citation count in OpenAlex
- Citations by year (recent)
-
2023: 2, 2022: 1, 2021: 1Per-year citation counts (last 5 years)
- References (count)
-
40Number of works referenced by this work
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W3125128655 |
|---|---|
| doi | https://doi.org/10.1101/2021.01.21.427551 |
| ids.doi | https://doi.org/10.1101/2021.01.21.427551 |
| ids.mag | 3125128655 |
| ids.openalex | https://openalex.org/W3125128655 |
| fwci | 0.39012971 |
| type | preprint |
| title | Clustering FunFams using sequence embeddings improves EC purity |
| awards[0].id | https://openalex.org/G1305978807 |
| awards[0].funder_id | https://openalex.org/F4320334629 |
| awards[0].display_name | |
| awards[0].funder_award_id | BB/R009597/1 |
| awards[0].funder_display_name | Biotechnology and Biological Sciences Research Council |
| awards[1].id | https://openalex.org/G1271904115 |
| awards[1].funder_id | https://openalex.org/F4320334629 |
| awards[1].display_name | |
| awards[1].funder_award_id | BB/R014892/1 |
| awards[1].funder_display_name | Biotechnology and Biological Sciences Research Council |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T12254 |
| topics[0].field.id | https://openalex.org/fields/13 |
| topics[0].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[0].score | 0.998199999332428 |
| topics[0].domain.id | https://openalex.org/domains/1 |
| topics[0].domain.display_name | Life Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1312 |
| topics[0].subfield.display_name | Molecular Biology |
| topics[0].display_name | Machine Learning in Bioinformatics |
| topics[1].id | https://openalex.org/T10887 |
| topics[1].field.id | https://openalex.org/fields/13 |
| topics[1].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[1].score | 0.9970999956130981 |
| topics[1].domain.id | https://openalex.org/domains/1 |
| topics[1].domain.display_name | Life Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1312 |
| topics[1].subfield.display_name | Molecular Biology |
| topics[1].display_name | Bioinformatics and Genomic Networks |
| topics[2].id | https://openalex.org/T10044 |
| topics[2].field.id | https://openalex.org/fields/13 |
| topics[2].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[2].score | 0.9948999881744385 |
| topics[2].domain.id | https://openalex.org/domains/1 |
| topics[2].domain.display_name | Life Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1312 |
| topics[2].subfield.display_name | Molecular Biology |
| topics[2].display_name | Protein Structure and Dynamics |
| funders[0].id | https://openalex.org/F4320334629 |
| funders[0].ror | https://ror.org/00cwqg982 |
| funders[0].display_name | Biotechnology and Biological Sciences Research Council |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2776436953 |
| concepts[0].level | 2 |
| concepts[0].score | 0.6983429193496704 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q5163215 |
| concepts[0].display_name | Consistency (knowledge bases) |
| concepts[1].id | https://openalex.org/C73555534 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6187466382980347 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q622825 |
| concepts[1].display_name | Cluster analysis |
| concepts[2].id | https://openalex.org/C14036430 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6117562055587769 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q3736076 |
| concepts[2].display_name | Function (biology) |
| concepts[3].id | https://openalex.org/C2776214188 |
| concepts[3].level | 2 |
| concepts[3].score | 0.6000620126724243 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q408386 |
| concepts[3].display_name | Inference |
| concepts[4].id | https://openalex.org/C103278499 |
| concepts[4].level | 3 |
| concepts[4].score | 0.5896692872047424 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q254465 |
| concepts[4].display_name | Similarity (geometry) |
| concepts[5].id | https://openalex.org/C2778112365 |
| concepts[5].level | 2 |
| concepts[5].score | 0.5857795476913452 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q3511065 |
| concepts[5].display_name | Sequence (biology) |
| concepts[6].id | https://openalex.org/C41008148 |
| concepts[6].level | 0 |
| concepts[6].score | 0.5813824534416199 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[6].display_name | Computer science |
| concepts[7].id | https://openalex.org/C70721500 |
| concepts[7].level | 1 |
| concepts[7].score | 0.5348467230796814 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q177005 |
| concepts[7].display_name | Computational biology |
| concepts[8].id | https://openalex.org/C164866538 |
| concepts[8].level | 2 |
| concepts[8].score | 0.4811747968196869 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q367351 |
| concepts[8].display_name | Cluster (spacecraft) |
| concepts[9].id | https://openalex.org/C2776760102 |
| concepts[9].level | 3 |
| concepts[9].score | 0.4752923250198364 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q5139990 |
| concepts[9].display_name | Code (set theory) |
| concepts[10].id | https://openalex.org/C79337645 |
| concepts[10].level | 2 |
| concepts[10].score | 0.4594508111476898 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q779824 |
| concepts[10].display_name | Outlier |
| concepts[11].id | https://openalex.org/C125411270 |
| concepts[11].level | 2 |
| concepts[11].score | 0.43799299001693726 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q18653 |
| concepts[11].display_name | Encoding (memory) |
| concepts[12].id | https://openalex.org/C124101348 |
| concepts[12].level | 1 |
| concepts[12].score | 0.3703354597091675 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q172491 |
| concepts[12].display_name | Data mining |
| concepts[13].id | https://openalex.org/C80444323 |
| concepts[13].level | 1 |
| concepts[13].score | 0.3238376975059509 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q2878974 |
| concepts[13].display_name | Theoretical computer science |
| concepts[14].id | https://openalex.org/C86803240 |
| concepts[14].level | 0 |
| concepts[14].score | 0.312841534614563 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[14].display_name | Biology |
| concepts[15].id | https://openalex.org/C154945302 |
| concepts[15].level | 1 |
| concepts[15].score | 0.2584765553474426 |
| concepts[15].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[15].display_name | Artificial intelligence |
| concepts[16].id | https://openalex.org/C177264268 |
| concepts[16].level | 2 |
| concepts[16].score | 0.20885774493217468 |
| concepts[16].wikidata | https://www.wikidata.org/wiki/Q1514741 |
| concepts[16].display_name | Set (abstract data type) |
| concepts[17].id | https://openalex.org/C54355233 |
| concepts[17].level | 1 |
| concepts[17].score | 0.16356855630874634 |
| concepts[17].wikidata | https://www.wikidata.org/wiki/Q7162 |
| concepts[17].display_name | Genetics |
| concepts[18].id | https://openalex.org/C115961682 |
| concepts[18].level | 2 |
| concepts[18].score | 0.08839893341064453 |
| concepts[18].wikidata | https://www.wikidata.org/wiki/Q860623 |
| concepts[18].display_name | Image (mathematics) |
| concepts[19].id | https://openalex.org/C199360897 |
| concepts[19].level | 1 |
| concepts[19].score | 0.0 |
| concepts[19].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[19].display_name | Programming language |
| keywords[0].id | https://openalex.org/keywords/consistency |
| keywords[0].score | 0.6983429193496704 |
| keywords[0].display_name | Consistency (knowledge bases) |
| keywords[1].id | https://openalex.org/keywords/cluster-analysis |
| keywords[1].score | 0.6187466382980347 |
| keywords[1].display_name | Cluster analysis |
| keywords[2].id | https://openalex.org/keywords/function |
| keywords[2].score | 0.6117562055587769 |
| keywords[2].display_name | Function (biology) |
| keywords[3].id | https://openalex.org/keywords/inference |
| keywords[3].score | 0.6000620126724243 |
| keywords[3].display_name | Inference |
| keywords[4].id | https://openalex.org/keywords/similarity |
| keywords[4].score | 0.5896692872047424 |
| keywords[4].display_name | Similarity (geometry) |
| keywords[5].id | https://openalex.org/keywords/sequence |
| keywords[5].score | 0.5857795476913452 |
| keywords[5].display_name | Sequence (biology) |
| keywords[6].id | https://openalex.org/keywords/computer-science |
| keywords[6].score | 0.5813824534416199 |
| keywords[6].display_name | Computer science |
| keywords[7].id | https://openalex.org/keywords/computational-biology |
| keywords[7].score | 0.5348467230796814 |
| keywords[7].display_name | Computational biology |
| keywords[8].id | https://openalex.org/keywords/cluster |
| keywords[8].score | 0.4811747968196869 |
| keywords[8].display_name | Cluster (spacecraft) |
| keywords[9].id | https://openalex.org/keywords/code |
| keywords[9].score | 0.4752923250198364 |
| keywords[9].display_name | Code (set theory) |
| keywords[10].id | https://openalex.org/keywords/outlier |
| keywords[10].score | 0.4594508111476898 |
| keywords[10].display_name | Outlier |
| keywords[11].id | https://openalex.org/keywords/encoding |
| keywords[11].score | 0.43799299001693726 |
| keywords[11].display_name | Encoding (memory) |
| keywords[12].id | https://openalex.org/keywords/data-mining |
| keywords[12].score | 0.3703354597091675 |
| keywords[12].display_name | Data mining |
| keywords[13].id | https://openalex.org/keywords/theoretical-computer-science |
| keywords[13].score | 0.3238376975059509 |
| keywords[13].display_name | Theoretical computer science |
| keywords[14].id | https://openalex.org/keywords/biology |
| keywords[14].score | 0.312841534614563 |
| keywords[14].display_name | Biology |
| keywords[15].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[15].score | 0.2584765553474426 |
| keywords[15].display_name | Artificial intelligence |
| keywords[16].id | https://openalex.org/keywords/set |
| keywords[16].score | 0.20885774493217468 |
| keywords[16].display_name | Set (abstract data type) |
| keywords[17].id | https://openalex.org/keywords/genetics |
| keywords[17].score | 0.16356855630874634 |
| keywords[17].display_name | Genetics |
| keywords[18].id | https://openalex.org/keywords/image |
| keywords[18].score | 0.08839893341064453 |
| keywords[18].display_name | Image (mathematics) |
| language | en |
| locations[0].id | doi:10.1101/2021.01.21.427551 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306402567 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | False |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | bioRxiv (Cold Spring Harbor Laboratory) |
| locations[0].source.host_organization | https://openalex.org/I2750212522 |
| locations[0].source.host_organization_name | Cold Spring Harbor Laboratory |
| locations[0].source.host_organization_lineage | https://openalex.org/I2750212522 |
| locations[0].license | cc-by-nc-nd |
| locations[0].pdf_url | https://www.biorxiv.org/content/biorxiv/early/2021/04/01/2021.01.21.427551.full.pdf |
| locations[0].version | acceptedVersion |
| locations[0].raw_type | posted-content |
| locations[0].license_id | https://openalex.org/licenses/cc-by-nc-nd |
| locations[0].is_accepted | True |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.1101/2021.01.21.427551 |
| indexed_in | crossref |
| authorships[0].author.id | https://openalex.org/A5025018140 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-8533-8163 |
| authorships[0].author.display_name | Maria Littmann |
| authorships[0].countries | DE |
| authorships[0].affiliations[0].institution_ids | https://openalex.org/I62916508 |
| authorships[0].affiliations[0].raw_affiliation_string | TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany |
| authorships[0].affiliations[1].institution_ids | https://openalex.org/I62916508 |
| authorships[0].affiliations[1].raw_affiliation_string | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany |
| authorships[0].institutions[0].id | https://openalex.org/I62916508 |
| authorships[0].institutions[0].ror | https://ror.org/02kkvpp62 |
| authorships[0].institutions[0].type | education |
| authorships[0].institutions[0].lineage | https://openalex.org/I62916508 |
| authorships[0].institutions[0].country_code | DE |
| authorships[0].institutions[0].display_name | Technical University of Munich |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Maria Littmann |
| authorships[0].is_corresponding | True |
| authorships[0].raw_affiliation_strings | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany, TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany |
| authorships[1].author.id | https://openalex.org/A5038645988 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-6568-9035 |
| authorships[1].author.display_name | Nicola Bordin |
| authorships[1].countries | GB |
| authorships[1].affiliations[0].institution_ids | https://openalex.org/I4210157240, https://openalex.org/I45129253 |
| authorships[1].affiliations[0].raw_affiliation_string | Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK |
| authorships[1].institutions[0].id | https://openalex.org/I4210157240 |
| authorships[1].institutions[0].ror | https://ror.org/05wsetc54 |
| authorships[1].institutions[0].type | facility |
| authorships[1].institutions[0].lineage | https://openalex.org/I124357947, https://openalex.org/I4210157240, https://openalex.org/I45129253, https://openalex.org/I98259816 |
| authorships[1].institutions[0].country_code | GB |
| authorships[1].institutions[0].display_name | Institute of Structural and Molecular Biology |
| authorships[1].institutions[1].id | https://openalex.org/I45129253 |
| authorships[1].institutions[1].ror | https://ror.org/02jx3x895 |
| authorships[1].institutions[1].type | education |
| authorships[1].institutions[1].lineage | https://openalex.org/I124357947, https://openalex.org/I45129253 |
| authorships[1].institutions[1].country_code | GB |
| authorships[1].institutions[1].display_name | University College London |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Nicola Bordin |
| authorships[1].is_corresponding | False |
| authorships[1].raw_affiliation_strings | Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK |
| authorships[2].author.id | https://openalex.org/A5075726670 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-9601-3580 |
| authorships[2].author.display_name | Michael Heinzinger |
| authorships[2].countries | DE |
| authorships[2].affiliations[0].institution_ids | https://openalex.org/I62916508 |
| authorships[2].affiliations[0].raw_affiliation_string | TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany |
| authorships[2].affiliations[1].institution_ids | https://openalex.org/I62916508 |
| authorships[2].affiliations[1].raw_affiliation_string | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany |
| authorships[2].institutions[0].id | https://openalex.org/I62916508 |
| authorships[2].institutions[0].ror | https://ror.org/02kkvpp62 |
| authorships[2].institutions[0].type | education |
| authorships[2].institutions[0].lineage | https://openalex.org/I62916508 |
| authorships[2].institutions[0].country_code | DE |
| authorships[2].institutions[0].display_name | Technical University of Munich |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Michael Heinzinger |
| authorships[2].is_corresponding | False |
| authorships[2].raw_affiliation_strings | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany, TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany |
| authorships[3].author.id | https://openalex.org/A5062361890 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-3957-412X |
| authorships[3].author.display_name | Konstantin Schütze |
| authorships[3].countries | DE |
| authorships[3].affiliations[0].institution_ids | https://openalex.org/I62916508 |
| authorships[3].affiliations[0].raw_affiliation_string | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany |
| authorships[3].institutions[0].id | https://openalex.org/I62916508 |
| authorships[3].institutions[0].ror | https://ror.org/02kkvpp62 |
| authorships[3].institutions[0].type | education |
| authorships[3].institutions[0].lineage | https://openalex.org/I62916508 |
| authorships[3].institutions[0].country_code | DE |
| authorships[3].institutions[0].display_name | Technical University of Munich |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Konstantin Schütze |
| authorships[3].is_corresponding | False |
| authorships[3].raw_affiliation_strings | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany |
| authorships[4].author.id | https://openalex.org/A5088531553 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-4650-6181 |
| authorships[4].author.display_name | Christian Dallago |
| authorships[4].countries | DE |
| authorships[4].affiliations[0].institution_ids | https://openalex.org/I62916508 |
| authorships[4].affiliations[0].raw_affiliation_string | TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany |
| authorships[4].affiliations[1].institution_ids | https://openalex.org/I62916508 |
| authorships[4].affiliations[1].raw_affiliation_string | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany |
| authorships[4].institutions[0].id | https://openalex.org/I62916508 |
| authorships[4].institutions[0].ror | https://ror.org/02kkvpp62 |
| authorships[4].institutions[0].type | education |
| authorships[4].institutions[0].lineage | https://openalex.org/I62916508 |
| authorships[4].institutions[0].country_code | DE |
| authorships[4].institutions[0].display_name | Technical University of Munich |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Christian Dallago |
| authorships[4].is_corresponding | False |
| authorships[4].raw_affiliation_strings | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany, TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany |
| authorships[5].author.id | https://openalex.org/A5053149072 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-7141-8936 |
| authorships[5].author.display_name | Christine Orengo |
| authorships[5].countries | GB |
| authorships[5].affiliations[0].institution_ids | https://openalex.org/I4210157240, https://openalex.org/I45129253 |
| authorships[5].affiliations[0].raw_affiliation_string | Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK |
| authorships[5].institutions[0].id | https://openalex.org/I4210157240 |
| authorships[5].institutions[0].ror | https://ror.org/05wsetc54 |
| authorships[5].institutions[0].type | facility |
| authorships[5].institutions[0].lineage | https://openalex.org/I124357947, https://openalex.org/I4210157240, https://openalex.org/I45129253, https://openalex.org/I98259816 |
| authorships[5].institutions[0].country_code | GB |
| authorships[5].institutions[0].display_name | Institute of Structural and Molecular Biology |
| authorships[5].institutions[1].id | https://openalex.org/I45129253 |
| authorships[5].institutions[1].ror | https://ror.org/02jx3x895 |
| authorships[5].institutions[1].type | education |
| authorships[5].institutions[1].lineage | https://openalex.org/I124357947, https://openalex.org/I45129253 |
| authorships[5].institutions[1].country_code | GB |
| authorships[5].institutions[1].display_name | University College London |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Christine Orengo |
| authorships[5].is_corresponding | True |
| authorships[5].raw_affiliation_strings | Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK |
| authorships[6].author.id | https://openalex.org/A5064905883 |
| authorships[6].author.orcid | https://orcid.org/0000-0003-0179-8424 |
| authorships[6].author.display_name | Burkhard Rost |
| authorships[6].countries | DE |
| authorships[6].affiliations[0].institution_ids | https://openalex.org/I62916508 |
| authorships[6].affiliations[0].raw_affiliation_string | TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany |
| authorships[6].affiliations[1].institution_ids | https://openalex.org/I4210137766 |
| authorships[6].affiliations[1].raw_affiliation_string | Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 8578 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany |
| authorships[6].institutions[0].id | https://openalex.org/I4210137766 |
| authorships[6].institutions[0].ror | https://ror.org/03xg85719 |
| authorships[6].institutions[0].type | facility |
| authorships[6].institutions[0].lineage | https://openalex.org/I4210137766 |
| authorships[6].institutions[0].country_code | DE |
| authorships[6].institutions[0].display_name | Institute for Advanced Study |
| authorships[6].institutions[1].id | https://openalex.org/I62916508 |
| authorships[6].institutions[1].ror | https://ror.org/02kkvpp62 |
| authorships[6].institutions[1].type | education |
| authorships[6].institutions[1].lineage | https://openalex.org/I62916508 |
| authorships[6].institutions[1].country_code | DE |
| authorships[6].institutions[1].display_name | Technical University of Munich |
| authorships[6].author_position | last |
| authorships[6].raw_author_name | Burkhard Rost |
| authorships[6].is_corresponding | False |
| authorships[6].raw_affiliation_strings | Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 8578 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany, TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i2, Boltzmannstr. 3, 85748 Garching/Munich, Germany |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://www.biorxiv.org/content/biorxiv/early/2021/04/01/2021.01.21.427551.full.pdf |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Clustering FunFams using sequence embeddings improves EC purity |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T12254 |
| primary_topic.field.id | https://openalex.org/fields/13 |
| primary_topic.field.display_name | Biochemistry, Genetics and Molecular Biology |
| primary_topic.score | 0.998199999332428 |
| primary_topic.domain.id | https://openalex.org/domains/1 |
| primary_topic.domain.display_name | Life Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1312 |
| primary_topic.subfield.display_name | Molecular Biology |
| primary_topic.display_name | Machine Learning in Bioinformatics |
| related_works | https://openalex.org/W3006513224, https://openalex.org/W2046456988, https://openalex.org/W2357409937, https://openalex.org/W2510582230, https://openalex.org/W2978674666, https://openalex.org/W2074430941, https://openalex.org/W2113096305, https://openalex.org/W1977636359, https://openalex.org/W2772305933, https://openalex.org/W2580722822 |
| cited_by_count | 4 |
| counts_by_year[0].year | 2023 |
| counts_by_year[0].cited_by_count | 2 |
| counts_by_year[1].year | 2022 |
| counts_by_year[1].cited_by_count | 1 |
| counts_by_year[2].year | 2021 |
| counts_by_year[2].cited_by_count | 1 |
| locations_count | 1 |
| best_oa_location.id | doi:10.1101/2021.01.21.427551 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306402567 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | False |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | bioRxiv (Cold Spring Harbor Laboratory) |
| best_oa_location.source.host_organization | https://openalex.org/I2750212522 |
| best_oa_location.source.host_organization_name | Cold Spring Harbor Laboratory |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I2750212522 |
| best_oa_location.license | cc-by-nc-nd |
| best_oa_location.pdf_url | https://www.biorxiv.org/content/biorxiv/early/2021/04/01/2021.01.21.427551.full.pdf |
| best_oa_location.version | acceptedVersion |
| best_oa_location.raw_type | posted-content |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by-nc-nd |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.1101/2021.01.21.427551 |
| primary_location.id | doi:10.1101/2021.01.21.427551 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306402567 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | False |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | bioRxiv (Cold Spring Harbor Laboratory) |
| primary_location.source.host_organization | https://openalex.org/I2750212522 |
| primary_location.source.host_organization_name | Cold Spring Harbor Laboratory |
| primary_location.source.host_organization_lineage | https://openalex.org/I2750212522 |
| primary_location.license | cc-by-nc-nd |
| primary_location.pdf_url | https://www.biorxiv.org/content/biorxiv/early/2021/04/01/2021.01.21.427551.full.pdf |
| primary_location.version | acceptedVersion |
| primary_location.raw_type | posted-content |
| primary_location.license_id | https://openalex.org/licenses/cc-by-nc-nd |
| primary_location.is_accepted | True |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.1101/2021.01.21.427551 |
| publication_date | 2021-01-21 |
| publication_year | 2021 |
| referenced_works | https://openalex.org/W2139919097, https://openalex.org/W2108067237, https://openalex.org/W3106745904, https://openalex.org/W4239849280, https://openalex.org/W2101335101, https://openalex.org/W2989608901, https://openalex.org/W1892752174, https://openalex.org/W3013303091, https://openalex.org/W2959593662, https://openalex.org/W2050748701, https://openalex.org/W2995514860, https://openalex.org/W6763868836, https://openalex.org/W2980789587, https://openalex.org/W3040739508, https://openalex.org/W3010387158, https://openalex.org/W6748634344, https://openalex.org/W2896457183, https://openalex.org/W2943495267, https://openalex.org/W3083095175, https://openalex.org/W1989840001, https://openalex.org/W2972411752, https://openalex.org/W2104601544, https://openalex.org/W3136918052, https://openalex.org/W2133564696, https://openalex.org/W2803763037, https://openalex.org/W1965092590, https://openalex.org/W2126751256, https://openalex.org/W2018661561, https://openalex.org/W2793397305, https://openalex.org/W2951599627, https://openalex.org/W1673310716, https://openalex.org/W3146944767, https://openalex.org/W2963341956, https://openalex.org/W2962739339, https://openalex.org/W2964301648, https://openalex.org/W2101234009, https://openalex.org/W2100383158, https://openalex.org/W1604281503, https://openalex.org/W2964308564, https://openalex.org/W3158236124 |
| referenced_works_count | 40 |
| abstract_inverted_index.a | 108, 125 |
| abstract_inverted_index.7% | 65 |
| abstract_inverted_index.EC | 60, 172 |
| abstract_inverted_index.We | 74, 212 |
| abstract_inverted_index.an | 76, 184 |
| abstract_inverted_index.be | 28 |
| abstract_inverted_index.by | 87, 225 |
| abstract_inverted_index.in | 107 |
| abstract_inverted_index.of | 11, 48, 53, 57, 63, 67, 145, 191, 210, 223 |
| abstract_inverted_index.on | 164 |
| abstract_inverted_index.or | 124 |
| abstract_inverted_index.to | 27, 78, 116, 121, 136, 151, 159, 216 |
| abstract_inverted_index.we | 174, 182 |
| abstract_inverted_index.11% | 52 |
| abstract_inverted_index.For | 22 |
| abstract_inverted_index.Our | 154, 193 |
| abstract_inverted_index.all | 54 |
| abstract_inverted_index.and | 14, 62, 111, 134, 139, 230 |
| abstract_inverted_index.any | 220 |
| abstract_inverted_index.are | 232 |
| abstract_inverted_index.but | 161 |
| abstract_inverted_index.can | 7, 15, 195 |
| abstract_inverted_index.for | 178, 188, 219 |
| abstract_inverted_index.not | 157 |
| abstract_inverted_index.one | 20 |
| abstract_inverted_index.our | 9 |
| abstract_inverted_index.per | 148 |
| abstract_inverted_index.the | 122, 143, 199 |
| abstract_inverted_index.via | 234 |
| abstract_inverted_index.was | 156 |
| abstract_inverted_index.CATH | 43, 127 |
| abstract_inverted_index.Code | 229 |
| abstract_inverted_index.also | 162, 187 |
| abstract_inverted_index.been | 113 |
| abstract_inverted_index.from | 96, 102 |
| abstract_inverted_index.have | 69, 112 |
| abstract_inverted_index.help | 196 |
| abstract_inverted_index.into | 4, 45, 82 |
| abstract_inverted_index.more | 84, 207 |
| abstract_inverted_index.need | 26 |
| abstract_inverted_index.only | 32 |
| abstract_inverted_index.pure | 146 |
| abstract_inverted_index.same | 123 |
| abstract_inverted_index.such | 46 |
| abstract_inverted_index.this | 214 |
| abstract_inverted_index.with | 34, 202 |
| abstract_inverted_index.These | 93 |
| abstract_inverted_index.Thus, | 181 |
| abstract_inverted_index.Using | 130 |
| abstract_inverted_index.acids | 106 |
| abstract_inverted_index.allow | 16, 206 |
| abstract_inverted_index.amino | 105 |
| abstract_inverted_index.i.e., | 30 |
| abstract_inverted_index.other | 189, 221 |
| abstract_inverted_index.their | 89, 226 |
| abstract_inverted_index.this, | 23 |
| abstract_inverted_index.using | 167 |
| abstract_inverted_index.(1,526 | 66 |
| abstract_inverted_index.DBSCAN | 135 |
| abstract_inverted_index.FunFam | 149 |
| abstract_inverted_index.alone. | 170 |
| abstract_inverted_index.expect | 183, 213 |
| abstract_inverted_index.gained | 101 |
| abstract_inverted_index.groups | 47 |
| abstract_inverted_index.models | 98 |
| abstract_inverted_index.number | 144 |
| abstract_inverted_index.purity | 186 |
| abstract_inverted_index.random | 152 |
| abstract_inverted_index.those, | 64 |
| abstract_inverted_index.within | 19, 42 |
| abstract_inverted_index.(22,830 | 56 |
| abstract_inverted_index.22,830) | 68 |
| abstract_inverted_index.FunFams | 55, 81, 138, 160 |
| abstract_inverted_index.GitHub: | 235 |
| abstract_inverted_index.Results | 73 |
| abstract_inverted_index.aspects | 190 |
| abstract_inverted_index.between | 118, 132 |
| abstract_inverted_index.binding | 179 |
| abstract_inverted_index.cluster | 40, 80, 137 |
| abstract_inverted_index.contain | 31, 59 |
| abstract_inverted_index.created | 166 |
| abstract_inverted_index.doubled | 142 |
| abstract_inverted_index.equally | 218 |
| abstract_inverted_index.family. | 21 |
| abstract_inverted_index.further | 79, 114 |
| abstract_inverted_index.improve | 8 |
| abstract_inverted_index.limited | 158 |
| abstract_inverted_index.missing | 104 |
| abstract_inverted_index.propose | 75 |
| abstract_inverted_index.protein | 12 |
| abstract_inverted_index.results | 177, 194 |
| abstract_inverted_index.sharing | 50 |
| abstract_inverted_index.similar | 176 |
| abstract_inverted_index.succeed | 217 |
| abstract_inverted_index.through | 91 |
| abstract_inverted_index.203,639) | 58 |
| abstract_inverted_index.Abstract | 0 |
| abstract_inverted_index.Families | 38 |
| abstract_inverted_index.FunFams; | 198 |
| abstract_inverted_index.approach | 77, 155, 215 |
| abstract_inverted_index.clusters | 147, 201 |
| abstract_inverted_index.compared | 150 |
| abstract_inverted_index.encoding | 88 |
| abstract_inverted_index.families | 6, 25, 165 |
| abstract_inverted_index.function | 13 |
| abstract_inverted_index.grouping | 222 |
| abstract_inverted_index.identify | 140 |
| abstract_inverted_index.improved | 203 |
| abstract_inverted_index.language | 97 |
| abstract_inverted_index.observed | 175 |
| abstract_inverted_index.proteins | 3, 33, 41, 49, 119, 224 |
| abstract_inverted_index.reliable | 208 |
| abstract_inverted_index.sequence | 109, 168 |
| abstract_inverted_index.(FunFams) | 39 |
| abstract_inverted_index.available | 233 |
| abstract_inverted_index.belonging | 120 |
| abstract_inverted_index.different | 126 |
| abstract_inverted_index.distances | 131 |
| abstract_inverted_index.function. | 36, 51, 192 |
| abstract_inverted_index.identical | 35 |
| abstract_inverted_index.increased | 185 |
| abstract_inverted_index.inference | 209 |
| abstract_inverted_index.knowledge | 100 |
| abstract_inverted_index.optimized | 115 |
| abstract_inverted_index.originate | 95 |
| abstract_inverted_index.outliers, | 141 |
| abstract_inverted_index.resulting | 200 |
| abstract_inverted_index.sequences | 90 |
| abstract_inverted_index.succeeded | 163 |
| abstract_inverted_index.(ProtBERT) | 110 |
| abstract_inverted_index.Functional | 37 |
| abstract_inverted_index.Motivation | 1 |
| abstract_inverted_index.consistent | 85 |
| abstract_inverted_index.embeddings | 94, 133, 231 |
| abstract_inverted_index.functional | 5, 24, 71, 204 |
| abstract_inverted_index.generating | 197 |
| abstract_inverted_index.predicting | 103 |
| abstract_inverted_index.similarity | 169 |
| abstract_inverted_index.Classifying | 2 |
| abstract_inverted_index.annotations | 18, 61 |
| abstract_inverted_index.clustering. | 153 |
| abstract_inverted_index.consistency | 205 |
| abstract_inverted_index.distinguish | 117 |
| abstract_inverted_index.embeddings. | 92 |
| abstract_inverted_index.phenotypes. | 227 |
| abstract_inverted_index.superfamily | 128 |
| abstract_inverted_index.“pure”, | 29 |
| abstract_inverted_index.(PB-Tucker). | 129 |
| abstract_inverted_index.Availability | 228 |
| abstract_inverted_index.annotations, | 173 |
| abstract_inverted_index.annotations. | 72, 180, 211 |
| abstract_inverted_index.functionally | 83 |
| abstract_inverted_index.inconsistent | 70 |
| abstract_inverted_index.sub-families | 86 |
| abstract_inverted_index.transferring | 17, 99 |
| abstract_inverted_index.Complementing | 171 |
| abstract_inverted_index.superfamilies | 44 |
| abstract_inverted_index.understanding | 10 |
| abstract_inverted_index.https://github.com/Rostlab/FunFamsClustering | 236 |
| cited_by_percentile_year.max | 96 |
| cited_by_percentile_year.min | 89 |
| corresponding_author_ids | https://openalex.org/A5053149072, https://openalex.org/A5025018140 |
| countries_distinct_count | 2 |
| institutions_distinct_count | 7 |
| corresponding_institution_ids | https://openalex.org/I4210157240, https://openalex.org/I45129253, https://openalex.org/I62916508 |
| citation_normalized_percentile.value | 0.56870713 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | False |