Susan B. Davidson
YOU?
Author Swipe
View article: Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats
Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats Open
Infrastructure as Code (IaC) is fundamental to modern cloud computing, enabling teams to define and manage infrastructure through machine-readable configuration files. However, different cloud service providers utilize diverse IaC formats.…
View article: Developing Domain-Specific Language Models for Legal Terminology Using the Vosk Speech Recognition Toolkit
Developing Domain-Specific Language Models for Legal Terminology Using the Vosk Speech Recognition Toolkit Open
This study investigates the creation of domain-specific language models for legal terminology using the Vosk speech recognition toolkit. As the legal field increasingly adopts technology for transcription and documentation, the need for ac…
View article: SHARQ: Explainability Framework for Association Rules on Relational Data
SHARQ: Explainability Framework for Association Rules on Relational Data Open
Association rules are an important technique for gaining insights over large relational datasets consisting of tuples of elements (i.e. attribute-value pairs). However, it is difficult to explain the relative importance of data elements wi…
View article: ASQP-RL Demo: Learning Approximation Sets for Exploratory Queries
ASQP-RL Demo: Learning Approximation Sets for Exploratory Queries Open
We demonstrate the Approximate Selection Query Processing (ASQP-RL) system, which uses Reinforcement Learning to select a subset of a large external dataset to process locally in a notebook during data exploration. Given a query workload o…
View article: Learning Approximation Sets for Exploratory Queries
Learning Approximation Sets for Exploratory Queries Open
In data exploration, executing complex non-aggregate queries over large databases can be time-consuming. Our paper introduces a novel approach to address this challenge, focusing on finding an optimized subset of data, referred to as the a…
View article: Credit distribution in relational scientific databases
Credit distribution in relational scientific databases Open
Digital data is a basic form of research product for which citation, and the generation of credit or recognition for authors, are still not well understood. The notion of data credit has therefore recently emerged as a new measure, defined…
View article: Selecting Sub-tables for Data Exploration
Selecting Sub-tables for Data Exploration Open
We present a framework for creating small, informative sub-tables of large data tables to facilitate the first step of data science: data exploration. Given a large data table table T, the goal is to create a sub-table of small, fixed dime…
View article: Solon: Communication-efficient Byzantine-resilient Distributed Training via Redundant Gradients
Solon: Communication-efficient Byzantine-resilient Distributed Training via Redundant Gradients Open
There has been a growing need to provide Byzantine-resilience in distributed model training. Existing robust distributed learning algorithms focus on developing sophisticated robust aggregators at the parameter servers, but pay less attent…
View article: Chef: a cheap and fast pipeline for iteratively cleaning label uncertainties
Chef: a cheap and fast pipeline for iteratively cleaning label uncertainties Open
High-quality labels are expensive to obtain for many machine learning tasks, such as medical image classification tasks. Therefore, probabilistic (weak) labels produced by weak supervision tools are used to seed a process in which influent…
View article: Dynamic Gaussian Mixture based Deep Generative Model For Robust Forecasting on Sparse Multivariate Time Series
Dynamic Gaussian Mixture based Deep Generative Model For Robust Forecasting on Sparse Multivariate Time Series Open
Forecasting on sparse multivariate time series (MTS) aims to model the predictors of future values of time series given their incomplete past, which is important for many emerging applications. However, most existing methods process MTS’s …
View article: Dynamic Gaussian Mixture based Deep Generative Model For Robust Forecasting on Sparse Multivariate Time Series
Dynamic Gaussian Mixture based Deep Generative Model For Robust Forecasting on Sparse Multivariate Time Series Open
Forecasting on sparse multivariate time series (MTS) aims to model the predictors of future values of time series given their incomplete past, which is important for many emerging applications. However, most existing methods process MTS's …
View article: Web-based access to data for >600 disinfection by-products via the EPA CompTox Chemicals Dashboard
Web-based access to data for >600 disinfection by-products via the EPA CompTox Chemicals Dashboard Open
The US EPA’s CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) is a freely available web-based application providing access to data for ~900,000 chemical substances, the majority of these represented as chemical structures. T…
View article: DeltaGrad: Rapid retraining of machine learning models
DeltaGrad: Rapid retraining of machine learning models Open
Machine learning models are not static and may need to be retrained on slightly changed datasets, for instance, with the addition or deletion of a set of data points. This has many applications, including privacy, robustness, bias reductio…
View article: DeltaGrad: Rapid retraining of machine learning models
DeltaGrad: Rapid retraining of machine learning models Open
Machine learning models are not static and may need to be retrained on slightly changed datasets, for instance, with the addition or deletion of a set of data points. This has many applications, including privacy, robustness, bias reductio…
View article: Data Provenance for Attributes: Attribute Lineage.
Data Provenance for Attributes: Attribute Lineage. Open
In this paper we define a new kind of data provenance for database management systems, called attribute lineage for SPJRU queries, building on previous works on data provenance for tuples.
\n
\nWe take inspiration from the classical lineag…
View article: Alawinia/Provclustering: Discovering Similar Workflows Via Provenance Clustering
Alawinia/Provclustering: Discovering Similar Workflows Via Provenance Clustering Open
Several workflow management systems and scripting languages have adopted provenance tracking, yet many researchers choose to manually capture or instrument their processing scripts to write provenance information to files. The Next Generat…
View article: Alawinia/Provclustering: Discovering Similar Workflows Via Provenance Clustering
Alawinia/Provclustering: Discovering Similar Workflows Via Provenance Clustering Open
Several workflow management systems and scripting languages have adopted provenance tracking, yet many researchers choose to manually capture or instrument their processing scripts to write provenance information to files. The Next Generat…
View article: DBLP-NSF dataset SQL dump
DBLP-NSF dataset SQL dump Open
This dataset is called DBLP-NSF, which is a Postgresql database dump file that connects computer science publications—extracted from DBLP—to their NSF funding grants—extracted from the National Science Foundation grant dataset. This datase…
View article: Automating data citation: the eagle-i experience.
Automating data citation: the eagle-i experience. Open
Data citation is of growing concern for owners of curated databases, who wish to give credit to the contributors and curators responsible for portions of the dataset and enable the data retrieved by a query to be later examined. While seve…
View article: Data Citation
Data Citation Open
Data citation is an interesting computational challenge, whose solution draws on several well-studied problems in database theory: query answering using views, and provenance. We describe the problem, suggest an approach to its solution, a…
View article: Why data citation is a computational problem
Why data citation is a computational problem Open
Using database views to define citable units is the key to specifying and generating citations to data.
View article: PROX: Approximated Summarization of Data Provenance
PROX: Approximated Summarization of Data Provenance Open
Many modern applications involve collecting large amounts of data from multiple sources, and then aggregating and manipulating it in intricate ways. The complexity of such applications, combined with the size of the collected data, makes i…