Vetle I. Torvik
YOU?
Author Swipe
View article: PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science
PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science Open
Papers, patents, and clinical trials are essential scientific resources in biomedicine, crucial for knowledge sharing and dissemination. However, these documents are often stored in disparate databases with varying management standards and…
View article: Patterns of diversity in biomedical coauthorships: An analysis across authors’ ethnicity, gender, age, and expertise
Patterns of diversity in biomedical coauthorships: An analysis across authors’ ethnicity, gender, age, and expertise Open
Multiple studies have linked diversity in scientific collaborations to innovative and impactful research. Here, we explore how different diversity indices—ethnicity, gender, academic age, and topical expertise—interact and thereby influenc…
View article: AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark
AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark Open
Data cleaning is a time-consuming and error-prone manual process, even with modern workflow tools such as OpenRefine. We present AutoDCWorkflow, an LLM-based pipeline for automatically generating data-cleaning workflows. The pipeline takes…
View article: T-KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework
T-KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework Open
Entity resolution (ER) is the process of determining whether two representations refer to the same real-world entity and plays a crucial role in data curation and data cleaning. Recent studies have introduced the KAER framework, aiming to …
View article: What Do LLMs Need to Understand Graphs: A Survey of Parametric Representation of Graphs
What Do LLMs Need to Understand Graphs: A Survey of Parametric Representation of Graphs Open
Graphs, as a relational data structure, have been widely used for various application scenarios, like molecule design and recommender systems. Recently, large language models (LLMs) are reorganizing in the AI community for their expected r…
View article: PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science
PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science Open
Papers, patents, and clinical trials are essential scientific resources in biomedicine, crucial for knowledge sharing and dissemination. However, these documents are often stored in disparate databases with varying management standards and…
View article: T-KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework
T-KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework Open
Entity resolution (ER) is the process of determining whether two representations refer to the same real-world entity and plays a crucial role in data curation and data cleaning. Recent studies have introduced the KAER framework, aiming to …
View article: OpCitance: Citation contexts identified from the PubMed Central open access articles
OpCitance: Citation contexts identified from the PubMed Central open access articles Open
OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA X…
View article: KAER: A Knowledge Augmented Pre-Trained Language Model for Entity Resolution
KAER: A Knowledge Augmented Pre-Trained Language Model for Entity Resolution Open
Entity resolution has been an essential and well-studied task in data cleaning research for decades. Existing work has discussed the feasibility of utilizing pre-trained language models to perform entity resolution and achieved promising r…
View article: The Citation Cloud of a biomedical article: a free, public, web-based tool enabling citation analysis
The Citation Cloud of a biomedical article: a free, public, web-based tool enabling citation analysis Open
Background: An article’s citations are useful for finding related articles that may not be readily found by keyword searches or textual similarity. Citation analysis is also important for analyzing scientific innovation and the structure o…
View article: ORCIDs mapped to PubMed authors
ORCIDs mapped to PubMed authors Open
The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018), and for ORCIDs, primarily, the 2019 ORCID Public Data File https://orcid.org/. Matching an ORCID to an individual author…
View article: MapAffil 2018 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide with extracted disciplines, inferred GRIDs, and assigned ORCIDs
MapAffil 2018 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide with extracted disciplines, inferred GRIDs, and assigned ORCIDs Open
Prepared by Vetle Torvik 2021-05-07 The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters). • How was the dataset created? The dataset is based on a snapshot of PubMed (which inclu…
View article: Author-ity 2018 - PubMed author name disambiguated dataset
Author-ity 2018 - PubMed author name disambiguated dataset Open
Author-ity 2018 dataset Prepared by Vetle Torvik Apr. 22, 2021 The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018). A total of 29.1 million Article records and 114.2 million…
View article: The Citation Cloud of a Biomedical Article: Enabling Citation Analysis
The Citation Cloud of a Biomedical Article: Enabling Citation Analysis Open
Using open citations provided by iCite and other sources, we have built an extension to PubMed that allows any user to visualize and analyze the “citation cloud” around any target article A: the set of articles cited by A; those which cite…
View article: Building a PubMed knowledge graph
Building a PubMed knowledge graph Open
PubMed is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguated, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge g…
View article: Datasets from "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents"
Datasets from "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents" Open
# Overview These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe. The datasets consis…
View article: Expertise as an aspect of author contributions
Expertise as an aspect of author contributions Open
Authors contribute a wide variety of intellectual efforts to a research paper, ranging from initial conceptualization to final analysis and reporting, and many journals today publish the allocated responsibilities and credits with the pape…
View article: Self-citation is the hallmark of productive authors, of any gender
Self-citation is the hallmark of productive authors, of any gender Open
It was recently reported that men self-cite >50% more often than women across a wide variety of disciplines in the bibliographic database JSTOR. Here, we replicate this finding in a sample of 1.6 million papers from Author-ity, a version o…
View article: Examining Scientific Writing Styles from the Perspective of Linguistic Complexity
Examining Scientific Writing Styles from the Perspective of Linguistic Complexity Open
Publishing articles in high-impact English journals is difficult for scholars around the world, especially for non-native English-speaking scholars (NNESs), most of whom struggle with proficiency in English. In order to uncover the differe…
View article: Genni + Ethnea for the Author-ity 2009 dataset
Genni + Ethnea for the Author-ity 2009 dataset Open
Prepared by Vetle Torvik 2018-04-15 The dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed. • How was the dataset created? First and last names of authors in the Author-ity 2009 dataset was p…
View article: Conceptual novelty scores for PubMed articles
Conceptual novelty scores for PubMed articles Open
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is …
View article: MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide
MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide Open
MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. Prepared by Vetle Torvik 2018-04-05 The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII ch…
View article: Author-ity 2009 - PubMed author name disambiguated dataset
Author-ity 2009 - PubMed author name disambiguated dataset Open
Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03 The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompres…
View article: Self-citation analysis data based on PubMed Central subset (2002-2005)
Self-citation analysis data based on PubMed Central subset (2002-2005) Open
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 …
View article: Author-Linked data for Author-ity 2009
Author-Linked data for Author-ity 2009 Open
Provides links to Author-ity 2009, including records from principal investigators (on NIH and NSF grants), inventors on USPTO patents, and students/advisors on ProQuest dissertations. Note that NIH and NSF differ in the type of fields they…
View article: Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009
Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009 Open
Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fi…
View article: Geographical Distribution of Biomedical Research in the USA and China
Geographical Distribution of Biomedical Research in the USA and China Open
We analyze nearly 20 million geocoded PubMed articles with author affiliations. Using K-means clustering for the lower 48 US states and mainland China, we find that the average published paper is within a relatively short distance of a few…