Fatemeh Nargesian
YOU?
Author Swipe
View article: Approximating Opaque Top-k Queries
Approximating Opaque Top-k Queries Open
Combining query answering and data science workloads has become prevalent. An important class of such workloads is top-k queries with a scoring function implemented as an opaque UDF - a black box whose internal structure and scores on the …
View article: Causal Dataset Discovery with Large Language Models
Causal Dataset Discovery with Large Language Models Open
Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with…
View article: PLUTUS: Understanding Data Distribution Tailoring for Machine Learning
PLUTUS: Understanding Data Distribution Tailoring for Machine Learning Open
Existing data debugging tools allow users to trace model performance problems all the way to the data by efficiently identifying slices (conjunctions of features and values) for which a trained model performs significantly worse than the e…
View article: TrustLOG: The Second Workshop on Trustworthy Learning on Graphs
TrustLOG: The Second Workshop on Trustworthy Learning on Graphs Open
Learning on graphs (LOG) has a profound impact on various high-impact domains, such as information retrieval, social network analysis, computational chemistry and transportation. Despite decades of theoretical development, algorithmic adva…
View article: FairEM360: A Suite for Responsible Entity Matching
FairEM360: A Suite for Responsible Entity Matching Open
Entity matching is one the earliest tasks that occur in the big data pipeline and is alarmingly exposed to unintentional biases that affect the quality of data. Identifying and mitigating the biases that exist in the data or are introduced…
View article: Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching
Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching Open
Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on the…
View article: KOIOS: Top-k Semantic Overlap Set Search
KOIOS: Top-k Semantic Overlap Set Search Open
We study the top-k set similarity search problem using semantic overlap. While vanilla overlap requires exact matches between set elements, semantic overlap allows elements that are syntactically different but semantically related to incre…
View article: Sampling over Union of Joins
Sampling over Union of Joins Open
Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid th…
View article: Pylon: Semantic Table Union Search in Data Lakes
Pylon: Semantic Table Union Search in Data Lakes Open
The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exi…
View article: TSUBASA: Climate Network Construction on Historical and Real-Time Data
TSUBASA: Climate Network Construction on Historical and Real-Time Data Open
A climate network represents the global climate system by the interactions of a set of anomaly time-series. Network science has been applied on climate data to study the dynamics of a climate network. The core task and first step to enable…
View article: Data Lake Organization
Data Lake Organization Open
We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges i…
View article: Tailoring data source distributions for fairness-aware data integration
Tailoring data source distributions for fairness-aware data integration Open
Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: …
View article: AWLCO: All-Window Length Co-Occurrence
AWLCO: All-Window Length Co-Occurrence Open
Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to …
View article: AWLCO: All-Window Length Co-Occurrence
AWLCO: All-Window Length Co-Occurrence Open
Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to …
View article: Knowledge Translation: Extended Technical Report
Knowledge Translation: Extended Technical Report Open
We introduce Kensho, a tool for generating mapping rules between two Knowledge Bases (KBs). To create the mapping rules, Kensho starts with a set of correspondences and enriches them with additional semantic information automatically ident…
View article: JOSIE
JOSIE Open
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as …
View article: Optimizing Organizations for Navigating Data Lakes.
Optimizing Organizations for Navigating Data Lakes. Open
We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges i…
View article: Dataset Evolver: An Interactive Feature Engineering Notebook
Dataset Evolver: An Interactive Feature Engineering Notebook Open
We present DATASET EVOLVER, an interactive Jupyter notebook-based tool to support data scientists perform feature engineering for classification tasks. It provides users with suggestions on new features to construct, based on automated fea…
View article: Learning Feature Engineering for Classification
Learning Feature Engineering for Classification Open
Feature engineering is the task of improving predictive modelling performance on a dataset by transforming its feature space. Existing approaches to automate this process rely on either transformed feature space exploration through evaluat…
View article: LSH Ensemble: Internet-Scale Domain Search
LSH Ensemble: Internet-Scale Domain Search Open
We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as $|Q \cap X|/|Q|$, as the relevance measure of a domain $X$ to a query domain $Q$. Our…