Gradient Knowledge Distillation for Pre-trained Language Models Article Swipe
YOU?
·
· 2022
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2211.01071
Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning instance-wise outputs between the teacher and student, while neglecting an important knowledge source, i.e., the gradient of the teacher. The gradient characterizes how the teacher responds to changes in inputs, which we assume is beneficial for the student to better approximate the underlying mapping function of the teacher. Therefore, we propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process. Experimental results show that GKD outperforms previous KD methods regarding student performance. Further analysis shows that incorporating gradient knowledge makes the student behave more consistently with the teacher, improving the interpretability greatly.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2211.01071
- https://arxiv.org/pdf/2211.01071
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4308167517
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4308167517Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2211.01071Digital Object Identifier
- Title
-
Gradient Knowledge Distillation for Pre-trained Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2022Year of publication
- Publication date
-
2022-11-02Full publication date if available
- Authors
-
Lean Wang, Lei Li, Xu SunList of authors in order
- Landing page
-
https://arxiv.org/abs/2211.01071Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2211.01071Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2211.01071Direct OA link when available
- Concepts
-
Interpretability, Distillation, Computer science, Function (biology), Process (computing), Knowledge transfer, Artificial intelligence, Chemistry, Knowledge management, Chromatography, Biology, Evolutionary biology, Operating systemTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4308167517 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2211.01071 |
| ids.doi | https://doi.org/10.48550/arxiv.2211.01071 |
| ids.openalex | https://openalex.org/W4308167517 |
| fwci | |
| type | preprint |
| title | Gradient Knowledge Distillation for Pre-trained Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10028 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.998199999332428 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Topic Modeling |
| topics[1].id | https://openalex.org/T10181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9957000017166138 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Natural Language Processing Techniques |
| topics[2].id | https://openalex.org/T11714 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9815999865531921 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Multimodal Machine Learning Applications |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2781067378 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8487157225608826 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q17027399 |
| concepts[0].display_name | Interpretability |
| concepts[1].id | https://openalex.org/C204030448 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7603917121887207 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q101017 |
| concepts[1].display_name | Distillation |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.5788539052009583 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C14036430 |
| concepts[3].level | 2 |
| concepts[3].score | 0.4743184745311737 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q3736076 |
| concepts[3].display_name | Function (biology) |
| concepts[4].id | https://openalex.org/C98045186 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4354100823402405 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q205663 |
| concepts[4].display_name | Process (computing) |
| concepts[5].id | https://openalex.org/C2776960227 |
| concepts[5].level | 2 |
| concepts[5].score | 0.41311943531036377 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q2586354 |
| concepts[5].display_name | Knowledge transfer |
| concepts[6].id | https://openalex.org/C154945302 |
| concepts[6].level | 1 |
| concepts[6].score | 0.3444150388240814 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[6].display_name | Artificial intelligence |
| concepts[7].id | https://openalex.org/C185592680 |
| concepts[7].level | 0 |
| concepts[7].score | 0.2518218159675598 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q2329 |
| concepts[7].display_name | Chemistry |
| concepts[8].id | https://openalex.org/C56739046 |
| concepts[8].level | 1 |
| concepts[8].score | 0.20594826340675354 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q192060 |
| concepts[8].display_name | Knowledge management |
| concepts[9].id | https://openalex.org/C43617362 |
| concepts[9].level | 1 |
| concepts[9].score | 0.19083502888679504 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q170050 |
| concepts[9].display_name | Chromatography |
| concepts[10].id | https://openalex.org/C86803240 |
| concepts[10].level | 0 |
| concepts[10].score | 0.0 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[10].display_name | Biology |
| concepts[11].id | https://openalex.org/C78458016 |
| concepts[11].level | 1 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q840400 |
| concepts[11].display_name | Evolutionary biology |
| concepts[12].id | https://openalex.org/C111919701 |
| concepts[12].level | 1 |
| concepts[12].score | 0.0 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q9135 |
| concepts[12].display_name | Operating system |
| keywords[0].id | https://openalex.org/keywords/interpretability |
| keywords[0].score | 0.8487157225608826 |
| keywords[0].display_name | Interpretability |
| keywords[1].id | https://openalex.org/keywords/distillation |
| keywords[1].score | 0.7603917121887207 |
| keywords[1].display_name | Distillation |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.5788539052009583 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/function |
| keywords[3].score | 0.4743184745311737 |
| keywords[3].display_name | Function (biology) |
| keywords[4].id | https://openalex.org/keywords/process |
| keywords[4].score | 0.4354100823402405 |
| keywords[4].display_name | Process (computing) |
| keywords[5].id | https://openalex.org/keywords/knowledge-transfer |
| keywords[5].score | 0.41311943531036377 |
| keywords[5].display_name | Knowledge transfer |
| keywords[6].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[6].score | 0.3444150388240814 |
| keywords[6].display_name | Artificial intelligence |
| keywords[7].id | https://openalex.org/keywords/chemistry |
| keywords[7].score | 0.2518218159675598 |
| keywords[7].display_name | Chemistry |
| keywords[8].id | https://openalex.org/keywords/knowledge-management |
| keywords[8].score | 0.20594826340675354 |
| keywords[8].display_name | Knowledge management |
| keywords[9].id | https://openalex.org/keywords/chromatography |
| keywords[9].score | 0.19083502888679504 |
| keywords[9].display_name | Chromatography |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2211.01071 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2211.01071 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2211.01071 |
| locations[1].id | doi:10.48550/arxiv.2211.01071 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2211.01071 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5034611488 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Lean Wang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Wang, Lean |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100440277 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-0688-9619 |
| authorships[1].author.display_name | Lei Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Li, Lei |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5111863979 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-5389-7251 |
| authorships[2].author.display_name | Xu Sun |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Sun, Xu |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2211.01071 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Gradient Knowledge Distillation for Pre-trained Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10028 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.998199999332428 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Topic Modeling |
| related_works | https://openalex.org/W2905433371, https://openalex.org/W4390569940, https://openalex.org/W2888392564, https://openalex.org/W4361193272, https://openalex.org/W4310278675, https://openalex.org/W4388422664, https://openalex.org/W2806259446, https://openalex.org/W2963326959, https://openalex.org/W4312407344, https://openalex.org/W2894289927 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2211.01071 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2211.01071 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2211.01071 |
| primary_location.id | pmh:oai:arXiv.org:2211.01071 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2211.01071 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2211.01071 |
| publication_date | 2022-11-02 |
| publication_year | 2022 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 11, 15 |
| abstract_inverted_index.KD | 21, 104 |
| abstract_inverted_index.an | 4, 41 |
| abstract_inverted_index.by | 30 |
| abstract_inverted_index.in | 60 |
| abstract_inverted_index.is | 3, 65 |
| abstract_inverted_index.of | 48, 77 |
| abstract_inverted_index.to | 7, 14, 58, 70, 87 |
| abstract_inverted_index.we | 63, 81 |
| abstract_inverted_index.GKD | 101 |
| abstract_inverted_index.The | 51 |
| abstract_inverted_index.and | 37 |
| abstract_inverted_index.for | 23, 67 |
| abstract_inverted_index.how | 54 |
| abstract_inverted_index.the | 35, 46, 49, 55, 68, 73, 78, 89, 94, 117, 123, 126 |
| abstract_inverted_index.yet | 17 |
| abstract_inverted_index.(KD) | 2 |
| abstract_inverted_index.from | 10 |
| abstract_inverted_index.into | 93 |
| abstract_inverted_index.more | 120 |
| abstract_inverted_index.show | 99 |
| abstract_inverted_index.that | 100, 112 |
| abstract_inverted_index.with | 122 |
| abstract_inverted_index.(GKD) | 86 |
| abstract_inverted_index.i.e., | 45 |
| abstract_inverted_index.makes | 116 |
| abstract_inverted_index.shows | 111 |
| abstract_inverted_index.which | 62 |
| abstract_inverted_index.while | 39 |
| abstract_inverted_index.assume | 64 |
| abstract_inverted_index.behave | 119 |
| abstract_inverted_index.better | 71 |
| abstract_inverted_index.mainly | 27 |
| abstract_inverted_index.models | 26 |
| abstract_inverted_index.Further | 109 |
| abstract_inverted_index.between | 34 |
| abstract_inverted_index.changes | 59 |
| abstract_inverted_index.compact | 16 |
| abstract_inverted_index.inputs, | 61 |
| abstract_inverted_index.mapping | 75 |
| abstract_inverted_index.methods | 105 |
| abstract_inverted_index.outputs | 33 |
| abstract_inverted_index.propose | 82 |
| abstract_inverted_index.results | 98 |
| abstract_inverted_index.source, | 44 |
| abstract_inverted_index.student | 69, 107, 118 |
| abstract_inverted_index.teacher | 13, 36, 56 |
| abstract_inverted_index.Gradient | 83 |
| abstract_inverted_index.Previous | 20 |
| abstract_inverted_index.aligning | 31 |
| abstract_inverted_index.analysis | 110 |
| abstract_inverted_index.function | 76 |
| abstract_inverted_index.gradient | 47, 52, 90, 114 |
| abstract_inverted_index.greatly. | 128 |
| abstract_inverted_index.language | 25 |
| abstract_inverted_index.previous | 103 |
| abstract_inverted_index.process. | 96 |
| abstract_inverted_index.responds | 57 |
| abstract_inverted_index.student, | 38 |
| abstract_inverted_index.student. | 19 |
| abstract_inverted_index.teacher, | 124 |
| abstract_inverted_index.teacher. | 50, 79 |
| abstract_inverted_index.transfer | 8, 28 |
| abstract_inverted_index.Knowledge | 0, 84 |
| abstract_inverted_index.alignment | 91 |
| abstract_inverted_index.effective | 5 |
| abstract_inverted_index.framework | 6 |
| abstract_inverted_index.important | 42 |
| abstract_inverted_index.improving | 125 |
| abstract_inverted_index.knowledge | 9, 29, 43, 115 |
| abstract_inverted_index.objective | 92 |
| abstract_inverted_index.practices | 22 |
| abstract_inverted_index.regarding | 106 |
| abstract_inverted_index.Therefore, | 80 |
| abstract_inverted_index.beneficial | 66 |
| abstract_inverted_index.neglecting | 40 |
| abstract_inverted_index.underlying | 74 |
| abstract_inverted_index.approximate | 72 |
| abstract_inverted_index.incorporate | 88 |
| abstract_inverted_index.large-scale | 12 |
| abstract_inverted_index.outperforms | 102 |
| abstract_inverted_index.pre-trained | 24 |
| abstract_inverted_index.Distillation | 85 |
| abstract_inverted_index.Experimental | 97 |
| abstract_inverted_index.consistently | 121 |
| abstract_inverted_index.distillation | 1, 95 |
| abstract_inverted_index.performance. | 108 |
| abstract_inverted_index.characterizes | 53 |
| abstract_inverted_index.incorporating | 113 |
| abstract_inverted_index.instance-wise | 32 |
| abstract_inverted_index.well-performing | 18 |
| abstract_inverted_index.interpretability | 127 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.8799999952316284 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |