Fragment-based Pretraining and Finetuning on Molecular Graphs Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2310.03274
Property prediction on molecular graphs is an important application of Graph Neural Networks. Recently, unlabeled molecular data has become abundant, which facilitates the rapid development of self-supervised learning for GNNs in the chemical domain. In this work, we propose pretraining GNNs at the fragment level, a promising middle ground to overcome the limitations of node-level and graph-level pretraining. Borrowing techniques from recent work on principal subgraph mining, we obtain a compact vocabulary of prevalent fragments from a large pretraining dataset. From the extracted vocabulary, we introduce several fragment-based contrastive and predictive pretraining tasks. The contrastive learning task jointly pretrains two different GNNs: one on molecular graphs and the other on fragment graphs, which represents higher-order connectivity within molecules. By enforcing consistency between the fragment embedding and the aggregated embedding of the corresponding atoms from the molecular graphs, we ensure that the embeddings capture structural information at multiple resolutions. The structural information of fragment graphs is further exploited to extract auxiliary labels for graph-level predictive pretraining. We employ both the pretrained molecular-based and fragment-based GNNs for downstream prediction, thus utilizing the fragment information during finetuning. Our graph fragment-based pretraining (GraphFP) advances the performances on 5 out of 8 common molecular benchmarks and improves the performances on long-range biological benchmarks by at least 11.5%. Code is available at: https://github.com/lvkd84/GraphFP.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2310.03274
- https://arxiv.org/pdf/2310.03274
- OA Status
- green
- Cited By
- 3
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4387432253
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4387432253Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2310.03274Digital Object Identifier
- Title
-
Fragment-based Pretraining and Finetuning on Molecular GraphsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-10-05Full publication date if available
- Authors
-
Kha-Dinh Luong, Ambuj K. SinghList of authors in order
- Landing page
-
https://arxiv.org/abs/2310.03274Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2310.03274Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2310.03274Direct OA link when available
- Concepts
-
Fragment (logic), Computer science, Embedding, Molecular graph, Graph, Theoretical computer science, Artificial intelligence, AlgorithmTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
3Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 2, 2024: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4387432253 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2310.03274 |
| ids.doi | https://doi.org/10.48550/arxiv.2310.03274 |
| ids.openalex | https://openalex.org/W4387432253 |
| fwci | |
| type | preprint |
| title | Fragment-based Pretraining and Finetuning on Molecular Graphs |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10211 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9995999932289124 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1703 |
| topics[0].subfield.display_name | Computational Theory and Mathematics |
| topics[0].display_name | Computational Drug Discovery Methods |
| topics[1].id | https://openalex.org/T11948 |
| topics[1].field.id | https://openalex.org/fields/25 |
| topics[1].field.display_name | Materials Science |
| topics[1].score | 0.9983999729156494 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/2505 |
| topics[1].subfield.display_name | Materials Chemistry |
| topics[1].display_name | Machine Learning in Materials Science |
| topics[2].id | https://openalex.org/T11273 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.987500011920929 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Advanced Graph Neural Networks |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2776235265 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8022372722625732 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q18392052 |
| concepts[0].display_name | Fragment (logic) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.7668827176094055 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C41608201 |
| concepts[2].level | 2 |
| concepts[2].score | 0.5686712861061096 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q980509 |
| concepts[2].display_name | Embedding |
| concepts[3].id | https://openalex.org/C2780022179 |
| concepts[3].level | 3 |
| concepts[3].score | 0.48552775382995605 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1986794 |
| concepts[3].display_name | Molecular graph |
| concepts[4].id | https://openalex.org/C132525143 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4608132243156433 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q141488 |
| concepts[4].display_name | Graph |
| concepts[5].id | https://openalex.org/C80444323 |
| concepts[5].level | 1 |
| concepts[5].score | 0.4045037627220154 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q2878974 |
| concepts[5].display_name | Theoretical computer science |
| concepts[6].id | https://openalex.org/C154945302 |
| concepts[6].level | 1 |
| concepts[6].score | 0.39177289605140686 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[6].display_name | Artificial intelligence |
| concepts[7].id | https://openalex.org/C11413529 |
| concepts[7].level | 1 |
| concepts[7].score | 0.19222700595855713 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q8366 |
| concepts[7].display_name | Algorithm |
| keywords[0].id | https://openalex.org/keywords/fragment |
| keywords[0].score | 0.8022372722625732 |
| keywords[0].display_name | Fragment (logic) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.7668827176094055 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/embedding |
| keywords[2].score | 0.5686712861061096 |
| keywords[2].display_name | Embedding |
| keywords[3].id | https://openalex.org/keywords/molecular-graph |
| keywords[3].score | 0.48552775382995605 |
| keywords[3].display_name | Molecular graph |
| keywords[4].id | https://openalex.org/keywords/graph |
| keywords[4].score | 0.4608132243156433 |
| keywords[4].display_name | Graph |
| keywords[5].id | https://openalex.org/keywords/theoretical-computer-science |
| keywords[5].score | 0.4045037627220154 |
| keywords[5].display_name | Theoretical computer science |
| keywords[6].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[6].score | 0.39177289605140686 |
| keywords[6].display_name | Artificial intelligence |
| keywords[7].id | https://openalex.org/keywords/algorithm |
| keywords[7].score | 0.19222700595855713 |
| keywords[7].display_name | Algorithm |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2310.03274 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2310.03274 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2310.03274 |
| locations[1].id | doi:10.48550/arxiv.2310.03274 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2310.03274 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5093023199 |
| authorships[0].author.orcid | https://orcid.org/0009-0003-6919-4528 |
| authorships[0].author.display_name | Kha-Dinh Luong |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Luong, Kha-Dinh |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5036639779 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-1997-7140 |
| authorships[1].author.display_name | Ambuj K. Singh |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Singh, Ambuj |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2310.03274 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Fragment-based Pretraining and Finetuning on Molecular Graphs |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10211 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9995999932289124 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1703 |
| primary_topic.subfield.display_name | Computational Theory and Mathematics |
| primary_topic.display_name | Computational Drug Discovery Methods |
| related_works | https://openalex.org/W1973480752, https://openalex.org/W2805502594, https://openalex.org/W3132641048, https://openalex.org/W4253208712, https://openalex.org/W31439402, https://openalex.org/W2217679042, https://openalex.org/W2932872266, https://openalex.org/W2013842271, https://openalex.org/W2969553894, https://openalex.org/W2808877228 |
| cited_by_count | 3 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 2 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2310.03274 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2310.03274 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2310.03274 |
| primary_location.id | pmh:oai:arXiv.org:2310.03274 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2310.03274 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2310.03274 |
| publication_date | 2023-10-05 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.5 | 193 |
| abstract_inverted_index.8 | 196 |
| abstract_inverted_index.a | 45, 69, 76 |
| abstract_inverted_index.By | 118 |
| abstract_inverted_index.In | 34 |
| abstract_inverted_index.We | 165 |
| abstract_inverted_index.an | 6 |
| abstract_inverted_index.at | 41, 145, 209 |
| abstract_inverted_index.by | 208 |
| abstract_inverted_index.in | 30 |
| abstract_inverted_index.is | 5, 154, 213 |
| abstract_inverted_index.of | 9, 25, 53, 72, 129, 151, 195 |
| abstract_inverted_index.on | 2, 63, 103, 109, 192, 204 |
| abstract_inverted_index.to | 49, 157 |
| abstract_inverted_index.we | 37, 67, 84, 137 |
| abstract_inverted_index.Our | 184 |
| abstract_inverted_index.The | 93, 148 |
| abstract_inverted_index.and | 55, 89, 106, 125, 171, 200 |
| abstract_inverted_index.at: | 215 |
| abstract_inverted_index.for | 28, 161, 174 |
| abstract_inverted_index.has | 17 |
| abstract_inverted_index.one | 102 |
| abstract_inverted_index.out | 194 |
| abstract_inverted_index.the | 22, 31, 42, 51, 81, 107, 122, 126, 130, 134, 140, 168, 179, 190, 202 |
| abstract_inverted_index.two | 99 |
| abstract_inverted_index.Code | 212 |
| abstract_inverted_index.From | 80 |
| abstract_inverted_index.GNNs | 29, 40, 173 |
| abstract_inverted_index.both | 167 |
| abstract_inverted_index.data | 16 |
| abstract_inverted_index.from | 60, 75, 133 |
| abstract_inverted_index.task | 96 |
| abstract_inverted_index.that | 139 |
| abstract_inverted_index.this | 35 |
| abstract_inverted_index.thus | 177 |
| abstract_inverted_index.work | 62 |
| abstract_inverted_index.GNNs: | 101 |
| abstract_inverted_index.Graph | 10 |
| abstract_inverted_index.atoms | 132 |
| abstract_inverted_index.graph | 185 |
| abstract_inverted_index.large | 77 |
| abstract_inverted_index.least | 210 |
| abstract_inverted_index.other | 108 |
| abstract_inverted_index.rapid | 23 |
| abstract_inverted_index.which | 20, 112 |
| abstract_inverted_index.work, | 36 |
| abstract_inverted_index.11.5%. | 211 |
| abstract_inverted_index.Neural | 11 |
| abstract_inverted_index.become | 18 |
| abstract_inverted_index.common | 197 |
| abstract_inverted_index.during | 182 |
| abstract_inverted_index.employ | 166 |
| abstract_inverted_index.ensure | 138 |
| abstract_inverted_index.graphs | 4, 105, 153 |
| abstract_inverted_index.ground | 48 |
| abstract_inverted_index.labels | 160 |
| abstract_inverted_index.level, | 44 |
| abstract_inverted_index.middle | 47 |
| abstract_inverted_index.obtain | 68 |
| abstract_inverted_index.recent | 61 |
| abstract_inverted_index.tasks. | 92 |
| abstract_inverted_index.within | 116 |
| abstract_inverted_index.between | 121 |
| abstract_inverted_index.capture | 142 |
| abstract_inverted_index.compact | 70 |
| abstract_inverted_index.domain. | 33 |
| abstract_inverted_index.extract | 158 |
| abstract_inverted_index.further | 155 |
| abstract_inverted_index.graphs, | 111, 136 |
| abstract_inverted_index.jointly | 97 |
| abstract_inverted_index.mining, | 66 |
| abstract_inverted_index.propose | 38 |
| abstract_inverted_index.several | 86 |
| abstract_inverted_index.Property | 0 |
| abstract_inverted_index.advances | 189 |
| abstract_inverted_index.chemical | 32 |
| abstract_inverted_index.dataset. | 79 |
| abstract_inverted_index.fragment | 43, 110, 123, 152, 180 |
| abstract_inverted_index.improves | 201 |
| abstract_inverted_index.learning | 27, 95 |
| abstract_inverted_index.multiple | 146 |
| abstract_inverted_index.overcome | 50 |
| abstract_inverted_index.subgraph | 65 |
| abstract_inverted_index.(GraphFP) | 188 |
| abstract_inverted_index.Borrowing | 58 |
| abstract_inverted_index.Networks. | 12 |
| abstract_inverted_index.Recently, | 13 |
| abstract_inverted_index.abundant, | 19 |
| abstract_inverted_index.auxiliary | 159 |
| abstract_inverted_index.available | 214 |
| abstract_inverted_index.different | 100 |
| abstract_inverted_index.embedding | 124, 128 |
| abstract_inverted_index.enforcing | 119 |
| abstract_inverted_index.exploited | 156 |
| abstract_inverted_index.extracted | 82 |
| abstract_inverted_index.fragments | 74 |
| abstract_inverted_index.important | 7 |
| abstract_inverted_index.introduce | 85 |
| abstract_inverted_index.molecular | 3, 15, 104, 135, 198 |
| abstract_inverted_index.pretrains | 98 |
| abstract_inverted_index.prevalent | 73 |
| abstract_inverted_index.principal | 64 |
| abstract_inverted_index.promising | 46 |
| abstract_inverted_index.unlabeled | 14 |
| abstract_inverted_index.utilizing | 178 |
| abstract_inverted_index.aggregated | 127 |
| abstract_inverted_index.benchmarks | 199, 207 |
| abstract_inverted_index.biological | 206 |
| abstract_inverted_index.downstream | 175 |
| abstract_inverted_index.embeddings | 141 |
| abstract_inverted_index.long-range | 205 |
| abstract_inverted_index.molecules. | 117 |
| abstract_inverted_index.node-level | 54 |
| abstract_inverted_index.prediction | 1 |
| abstract_inverted_index.predictive | 90, 163 |
| abstract_inverted_index.pretrained | 169 |
| abstract_inverted_index.represents | 113 |
| abstract_inverted_index.structural | 143, 149 |
| abstract_inverted_index.techniques | 59 |
| abstract_inverted_index.vocabulary | 71 |
| abstract_inverted_index.application | 8 |
| abstract_inverted_index.consistency | 120 |
| abstract_inverted_index.contrastive | 88, 94 |
| abstract_inverted_index.development | 24 |
| abstract_inverted_index.facilitates | 21 |
| abstract_inverted_index.finetuning. | 183 |
| abstract_inverted_index.graph-level | 56, 162 |
| abstract_inverted_index.information | 144, 150, 181 |
| abstract_inverted_index.limitations | 52 |
| abstract_inverted_index.prediction, | 176 |
| abstract_inverted_index.pretraining | 39, 78, 91, 187 |
| abstract_inverted_index.vocabulary, | 83 |
| abstract_inverted_index.connectivity | 115 |
| abstract_inverted_index.higher-order | 114 |
| abstract_inverted_index.performances | 191, 203 |
| abstract_inverted_index.pretraining. | 57, 164 |
| abstract_inverted_index.resolutions. | 147 |
| abstract_inverted_index.corresponding | 131 |
| abstract_inverted_index.fragment-based | 87, 172, 186 |
| abstract_inverted_index.molecular-based | 170 |
| abstract_inverted_index.self-supervised | 26 |
| abstract_inverted_index.https://github.com/lvkd84/GraphFP. | 216 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.41999998688697815 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |