Large Language Models to Estimate CGI-S From Clinical Notes as a Measure of Depression Severity (Preprint) Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.2196/preprints.86906
BACKGROUND Real-world psychiatric care is marked by wide heterogeneity in clinical presentations and outcomes, underscoring the need for systematic approaches to outcome measurement. The Clinical Global Impression–Severity (CGI-S) scale is a brief, clinician-rated measure of overall illness severity widely used in psychiatric research but rarely documented in routine care. Large language models (LLMs) may enable automated extraction of CGI-S scores from narrative clinical notes providing scalable outcome measures for real-world clinical care and research. OBJECTIVE This study aimed to evaluate whether LLMs can estimate CGI-S scores from psychiatric clinical notes in patients with major depressive disorder (MDD) by first generating a clinician consensus gold standard dataset, and then comparing model-generated scores for validation. METHODS We used data from the Johns Hopkins electronic health record. Three psychiatrists independently rated 77 clinical notes using a validated depression-specific CGI rubric. Weighted Cohen’s kappa (κ) coefficients were calculated to assess interrater reliability and model–human agreement. Two prompting strategies, zero-shot and few-shot, were tested using GPT-4o, and agreement was compared against average human ratings. Exploratory analyses evaluated whether agreement varied by patient demographics, care setting, or note length. RESULTS Interrater reliability among psychiatrists was high (κ = 0.77–0.78). Agreement between model-generated and average human ratings was similarly strong (κ = 0.85) and was even higher for notes on which all three raters were in complete agreement (κ = 0.88). Weighted κ values remained consistently high across all subgroups (0.82–0.89), with no significant differences by age, sex, race, treatment location, or note length. CONCLUSIONS LLMs can accurately estimate clinician-rated CGI-S scores from psychiatric clinical notes, achieving reliability comparable to expert raters. This approach may enable scalable outcome measurement and support the implementation of measurement-based care in real-world psychiatric practice.
Related Topics
- Type
- article
- Landing Page
- https://doi.org/10.2196/preprints.86906
- OA Status
- gold
- OpenAlex ID
- https://openalex.org/W4415787470
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415787470Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.2196/preprints.86906Digital Object Identifier
- Title
-
Large Language Models to Estimate CGI-S From Clinical Notes as a Measure of Depression Severity (Preprint)Work title
- Type
-
articleOpenAlex work type
- Publication year
-
2025Year of publication
- Publication date
-
2025-11-02Full publication date if available
- Authors
-
Kevin C. Li, Ayah Zirikly, Sarah C. Collica, Fernando S. Goes, Congwen Zhao, Trang Quynh Nguyen, Jane P. Gagliardi, Benjamin A. Goldstein, Hwanhee Hong, Elizabeth A. Stuart, Peter P. ZandiList of authors in order
- Landing page
-
https://doi.org/10.2196/preprints.86906Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
goldOpen access status per OpenAlex
- OA URL
-
https://doi.org/10.2196/preprints.86906Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415787470 |
|---|---|
| doi | https://doi.org/10.2196/preprints.86906 |
| ids.doi | https://doi.org/10.2196/preprints.86906 |
| ids.openalex | https://openalex.org/W4415787470 |
| fwci | 0.0 |
| type | article |
| title | Large Language Models to Estimate CGI-S From Clinical Notes as a Measure of Depression Severity (Preprint) |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | |
| locations[0].id | doi:10.2196/preprints.86906 |
| locations[0].is_oa | True |
| locations[0].source | |
| locations[0].license | cc-by |
| locations[0].pdf_url | |
| locations[0].version | acceptedVersion |
| locations[0].raw_type | posted-content |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | True |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.2196/preprints.86906 |
| indexed_in | crossref |
| authorships[0].author.id | https://openalex.org/A5055479780 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-3004-2087 |
| authorships[0].author.display_name | Kevin C. Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Kevin Li |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5074138666 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-8441-1741 |
| authorships[1].author.display_name | Ayah Zirikly |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Ayah Zirikly |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5120228255 |
| authorships[2].author.orcid | https://orcid.org/0009-0007-0554-9296 |
| authorships[2].author.display_name | Sarah C. Collica |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Sarah C. Collica |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5051850972 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-6262-8264 |
| authorships[3].author.display_name | Fernando S. Goes |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Fernando S. Goes |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5005386609 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-8540-6880 |
| authorships[4].author.display_name | Congwen Zhao |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Congwen Zhao |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5083690232 |
| authorships[5].author.orcid | https://orcid.org/0000-0003-1653-5491 |
| authorships[5].author.display_name | Trang Quynh Nguyen |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Trang Nguyen |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5033890726 |
| authorships[6].author.orcid | https://orcid.org/0000-0003-4667-6607 |
| authorships[6].author.display_name | Jane P. Gagliardi |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Jane P. Gagliardi |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5067727757 |
| authorships[7].author.orcid | https://orcid.org/0000-0001-5261-3632 |
| authorships[7].author.display_name | Benjamin A. Goldstein |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Benjamin A. Goldstein |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5066385470 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-3736-6327 |
| authorships[8].author.display_name | Hwanhee Hong |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Hwanhee Hong |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5064807501 |
| authorships[9].author.orcid | https://orcid.org/0000-0002-9042-8611 |
| authorships[9].author.display_name | Elizabeth A. Stuart |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Elizabeth A. Stuart |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5065109317 |
| authorships[10].author.orcid | https://orcid.org/0000-0001-8423-2623 |
| authorships[10].author.display_name | Peter P. Zandi |
| authorships[10].author_position | last |
| authorships[10].raw_author_name | Peter P. Zandi |
| authorships[10].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://doi.org/10.2196/preprints.86906 |
| open_access.oa_status | gold |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-11-03T00:00:00 |
| display_name | Large Language Models to Estimate CGI-S From Clinical Notes as a Measure of Depression Severity (Preprint) |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | doi:10.2196/preprints.86906 |
| best_oa_location.is_oa | True |
| best_oa_location.source | |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | |
| best_oa_location.version | acceptedVersion |
| best_oa_location.raw_type | posted-content |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.2196/preprints.86906 |
| primary_location.id | doi:10.2196/preprints.86906 |
| primary_location.is_oa | True |
| primary_location.source | |
| primary_location.license | cc-by |
| primary_location.pdf_url | |
| primary_location.version | acceptedVersion |
| primary_location.raw_type | posted-content |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | True |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.2196/preprints.86906 |
| publication_date | 2025-11-02 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.= | 198, 211, 229 |
| abstract_inverted_index.a | 31, 103, 137 |
| abstract_inverted_index.77 | 133 |
| abstract_inverted_index.We | 119 |
| abstract_inverted_index.by | 7, 100, 180, 245 |
| abstract_inverted_index.in | 10, 41, 47, 93, 225, 288 |
| abstract_inverted_index.is | 5, 30 |
| abstract_inverted_index.no | 242 |
| abstract_inverted_index.of | 35, 58, 285 |
| abstract_inverted_index.on | 219 |
| abstract_inverted_index.or | 185, 251 |
| abstract_inverted_index.to | 21, 81, 149, 271 |
| abstract_inverted_index.κ | 232 |
| abstract_inverted_index.(κ | 197, 210, 228 |
| abstract_inverted_index.CGI | 140 |
| abstract_inverted_index.The | 24 |
| abstract_inverted_index.Two | 156 |
| abstract_inverted_index.all | 221, 238 |
| abstract_inverted_index.and | 13, 73, 109, 153, 160, 166, 203, 213, 281 |
| abstract_inverted_index.but | 44 |
| abstract_inverted_index.can | 85, 258 |
| abstract_inverted_index.for | 18, 69, 114, 217 |
| abstract_inverted_index.may | 54, 276 |
| abstract_inverted_index.the | 16, 123, 283 |
| abstract_inverted_index.was | 168, 195, 207, 214 |
| abstract_inverted_index.(κ) | 145 |
| abstract_inverted_index.LLMs | 84, 257 |
| abstract_inverted_index.This | 78, 274 |
| abstract_inverted_index.age, | 246 |
| abstract_inverted_index.care | 4, 72, 183, 287 |
| abstract_inverted_index.data | 121 |
| abstract_inverted_index.even | 215 |
| abstract_inverted_index.from | 61, 89, 122, 264 |
| abstract_inverted_index.gold | 106 |
| abstract_inverted_index.high | 196, 236 |
| abstract_inverted_index.need | 17 |
| abstract_inverted_index.note | 186, 252 |
| abstract_inverted_index.sex, | 247 |
| abstract_inverted_index.then | 110 |
| abstract_inverted_index.used | 40, 120 |
| abstract_inverted_index.were | 147, 162, 224 |
| abstract_inverted_index.wide | 8 |
| abstract_inverted_index.with | 95, 241 |
| abstract_inverted_index.(MDD) | 99 |
| abstract_inverted_index.0.85) | 212 |
| abstract_inverted_index.<sec> | 0, 76, 117, 189, 255 |
| abstract_inverted_index.CGI-S | 59, 87, 262 |
| abstract_inverted_index.Johns | 124 |
| abstract_inverted_index.Large | 50 |
| abstract_inverted_index.Three | 129 |
| abstract_inverted_index.aimed | 80 |
| abstract_inverted_index.among | 193 |
| abstract_inverted_index.care. | 49 |
| abstract_inverted_index.first | 101 |
| abstract_inverted_index.human | 172, 205 |
| abstract_inverted_index.kappa | 144 |
| abstract_inverted_index.major | 96 |
| abstract_inverted_index.notes | 64, 92, 135, 218 |
| abstract_inverted_index.race, | 248 |
| abstract_inverted_index.rated | 132 |
| abstract_inverted_index.scale | 29 |
| abstract_inverted_index.study | 79 |
| abstract_inverted_index.three | 222 |
| abstract_inverted_index.using | 136, 164 |
| abstract_inverted_index.which | 220 |
| abstract_inverted_index.(LLMs) | 53 |
| abstract_inverted_index.0.88). | 230 |
| abstract_inverted_index.</sec> | 75, 116, 188, 254, 292 |
| abstract_inverted_index.Global | 26 |
| abstract_inverted_index.across | 237 |
| abstract_inverted_index.assess | 150 |
| abstract_inverted_index.brief, | 32 |
| abstract_inverted_index.enable | 55, 277 |
| abstract_inverted_index.expert | 272 |
| abstract_inverted_index.health | 127 |
| abstract_inverted_index.higher | 216 |
| abstract_inverted_index.marked | 6 |
| abstract_inverted_index.models | 52 |
| abstract_inverted_index.notes, | 267 |
| abstract_inverted_index.rarely | 45 |
| abstract_inverted_index.raters | 223 |
| abstract_inverted_index.scores | 60, 88, 113, 263 |
| abstract_inverted_index.strong | 209 |
| abstract_inverted_index.tested | 163 |
| abstract_inverted_index.values | 233 |
| abstract_inverted_index.varied | 179 |
| abstract_inverted_index.widely | 39 |
| abstract_inverted_index.(CGI-S) | 28 |
| abstract_inverted_index.GPT-4o, | 165 |
| abstract_inverted_index.Hopkins | 125 |
| abstract_inverted_index.against | 170 |
| abstract_inverted_index.average | 171, 204 |
| abstract_inverted_index.between | 201 |
| abstract_inverted_index.illness | 37 |
| abstract_inverted_index.length. | 187, 253 |
| abstract_inverted_index.measure | 34 |
| abstract_inverted_index.outcome | 22, 67, 279 |
| abstract_inverted_index.overall | 36 |
| abstract_inverted_index.patient | 181 |
| abstract_inverted_index.raters. | 273 |
| abstract_inverted_index.ratings | 206 |
| abstract_inverted_index.record. | 128 |
| abstract_inverted_index.routine | 48 |
| abstract_inverted_index.rubric. | 141 |
| abstract_inverted_index.support | 282 |
| abstract_inverted_index.whether | 83, 177 |
| abstract_inverted_index.Clinical | 25 |
| abstract_inverted_index.Weighted | 142, 231 |
| abstract_inverted_index.analyses | 175 |
| abstract_inverted_index.approach | 275 |
| abstract_inverted_index.clinical | 11, 63, 71, 91, 134, 266 |
| abstract_inverted_index.compared | 169 |
| abstract_inverted_index.complete | 226 |
| abstract_inverted_index.dataset, | 108 |
| abstract_inverted_index.disorder | 98 |
| abstract_inverted_index.estimate | 86, 260 |
| abstract_inverted_index.evaluate | 82 |
| abstract_inverted_index.language | 51 |
| abstract_inverted_index.measures | 68 |
| abstract_inverted_index.patients | 94 |
| abstract_inverted_index.ratings. | 173 |
| abstract_inverted_index.remained | 234 |
| abstract_inverted_index.research | 43 |
| abstract_inverted_index.scalable | 66, 278 |
| abstract_inverted_index.setting, | 184 |
| abstract_inverted_index.severity | 38 |
| abstract_inverted_index.standard | 107 |
| abstract_inverted_index.Agreement | 200 |
| abstract_inverted_index.Cohen’s | 143 |
| abstract_inverted_index.achieving | 268 |
| abstract_inverted_index.agreement | 167, 178, 227 |
| abstract_inverted_index.automated | 56 |
| abstract_inverted_index.clinician | 104 |
| abstract_inverted_index.comparing | 111 |
| abstract_inverted_index.consensus | 105 |
| abstract_inverted_index.evaluated | 176 |
| abstract_inverted_index.few-shot, | 161 |
| abstract_inverted_index.location, | 250 |
| abstract_inverted_index.narrative | 62 |
| abstract_inverted_index.outcomes, | 14 |
| abstract_inverted_index.practice. | 291 |
| abstract_inverted_index.prompting | 157 |
| abstract_inverted_index.providing | 65 |
| abstract_inverted_index.research. | 74 |
| abstract_inverted_index.similarly | 208 |
| abstract_inverted_index.subgroups | 239 |
| abstract_inverted_index.treatment | 249 |
| abstract_inverted_index.validated | 138 |
| abstract_inverted_index.zero-shot | 159 |
| abstract_inverted_index.Interrater | 191 |
| abstract_inverted_index.Real-world | 2 |
| abstract_inverted_index.accurately | 259 |
| abstract_inverted_index.agreement. | 155 |
| abstract_inverted_index.approaches | 20 |
| abstract_inverted_index.calculated | 148 |
| abstract_inverted_index.comparable | 270 |
| abstract_inverted_index.depressive | 97 |
| abstract_inverted_index.documented | 46 |
| abstract_inverted_index.electronic | 126 |
| abstract_inverted_index.extraction | 57 |
| abstract_inverted_index.generating | 102 |
| abstract_inverted_index.interrater | 151 |
| abstract_inverted_index.real-world | 70, 289 |
| abstract_inverted_index.systematic | 19 |
| abstract_inverted_index.Exploratory | 174 |
| abstract_inverted_index.differences | 244 |
| abstract_inverted_index.measurement | 280 |
| abstract_inverted_index.psychiatric | 3, 42, 90, 265, 290 |
| abstract_inverted_index.reliability | 152, 192, 269 |
| abstract_inverted_index.significant | 243 |
| abstract_inverted_index.strategies, | 158 |
| abstract_inverted_index.validation. | 115 |
| abstract_inverted_index.coefficients | 146 |
| abstract_inverted_index.consistently | 235 |
| abstract_inverted_index.measurement. | 23 |
| abstract_inverted_index.underscoring | 15 |
| abstract_inverted_index.0.77–0.78). | 199 |
| abstract_inverted_index.demographics, | 182 |
| abstract_inverted_index.heterogeneity | 9 |
| abstract_inverted_index.independently | 131 |
| abstract_inverted_index.model–human | 154 |
| abstract_inverted_index.presentations | 12 |
| abstract_inverted_index.psychiatrists | 130, 194 |
| abstract_inverted_index.(0.82–0.89), | 240 |
| abstract_inverted_index.implementation | 284 |
| abstract_inverted_index.clinician-rated | 33, 261 |
| abstract_inverted_index.model-generated | 112, 202 |
| abstract_inverted_index.measurement-based | 286 |
| abstract_inverted_index.depression-specific | 139 |
| abstract_inverted_index.Impression–Severity | 27 |
| abstract_inverted_index.<title>METHODS</title> | 118 |
| abstract_inverted_index.<title>RESULTS</title> | 190 |
| abstract_inverted_index.<title>OBJECTIVE</title> | 77 |
| abstract_inverted_index.<title>BACKGROUND</title> | 1 |
| abstract_inverted_index.<title>CONCLUSIONS</title> | 256 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 11 |
| citation_normalized_percentile |