Inverted index ≈ Inverted index
View article
SourcererCC: Scaling Code Clone Detection to Big Code Open
Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cl…
View article
Simple, proven approaches to text retrieval Open
This technical note describes straightforward techniques for document indexing and retrieval that have been solidly established through extensive testing and are easy to apply. They are useful for many different types of text material, are…
View article
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List Open
Classical information retrieval systems such as BM25 rely on exact lexical match and can carry out search efficiently with inverted list index. Recent neural IR models shifts towards soft matching all query document terms, but they lose th…
View article
Learning Passage Impacts for Inverted Indexes Open
Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BE…
View article
Context-Aware Document Term Weighting for Ad-Hoc Search Open
Bag-of-words document representations play a fundamental role in modern search engines, but their power is limited by the shallow frequency-based term weighting scheme. This paper proposes HDCT, a context-aware document term weighting fram…
View article
Context-Aware Term Weighting For First Stage Passage Retrieval Open
Term frequency is a common method for identifying the importance of a term in a document. But term frequency ignores how a term interacts with its text context, which is key to estimating document-specific term weights. This paper proposes…
View article
JOSIE Open
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as …
View article
A Practical q -Gram Index for Text Retrieval Allowing Errors Open
We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately m…
View article
BitFunnel Open
Since the mid-90s there has been a widely-held belief that signature files are inferior to inverted files for text indexing. In recent years the Bing search engine has developed and deployed an index based on bit-sliced signatures. This in…
View article
GIFT: A Real-time and Scalable 3D Shape Search Engine Open
Projective analysis is an important solution for 3D shape retrieval, since human visual perceptions of 3D shapes rely on various 2D observations from different view points. Although multiple informative and discriminative views are utilize…
View article
Real-time structural motif searching in proteins using an inverted index strategy Open
Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding si…
View article
A Unified Index for Spatio-Temporal Keyword Queries Open
From tweets to urban data sets, there has been an explosion in the volume of textual data that is associated with both temporal and spatial components. Efficiently evaluating queries over these data is challenging. Previous approaches have…
View article
FSSE: An Effective Fuzzy Semantic Searchable Encryption Scheme Over Encrypted Cloud Data Open
Currently, searchable encryption has attracted considerable attention in the field of cloud computing. The existing research mainly focuses on keyword-based search schemes, most of which support the exact matching of keywords. However, key…
View article
Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations Open
Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inver…
View article
Longest Common Extensions with Recompression Open
Given two positions i and j in a string T of length N, a longest common extension (LCE) query asks for the length of the longest common prefix between suffixes beginning at i and j. A compressed LCE data structure stores T in a compressed …
View article
Fast Dictionary-Based Compression for Inverted Indexes Open
Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques…
View article
Pavo: A RNN-Based Learned Inverted Index, Supervised or Unsupervised? Open
The booms of big data and graphic processing unit technologies have allowed us to explore more appropriate data structures and algorithms with smaller time complexity. However, the application of machine learning as a potential alternative…
View article
Clustered Elias-Fano Indexes Open
State-of-the-art encoders for inverted indexes compress each posting list individually . Encoding clusters of posting lists offers the possibility of reducing the redundancy of the lists while maintaining a noticeable query processing spee…
View article
An Effective Scholarly Search by Combining Inverted Indices and Structured Search With Citation Networks Analysis Open
The rapid growth in the number of scholarly documents on the Web and in other digital platforms makes it challenging for researchers to find research publications most relevant to their information needs. This challenge has been mitigated …
View article
SourcererCC and SourcererCC-I Open
Given the availability of large source-code repositories, there has been a large number of applications for large-scale clone detection. Unfortunately, despite a decade of active research, there is a marked lack in clone detectors that sca…
View article
The Potential of Learned Index Structures for Index Compression Open
Inverted indexes are vital in providing fast key-word-based search. For every term in the document collection, a list of identifiers of documents in which the term appears is stored, along with auxiliary information such as term frequency,…
View article
ELII: A novel inverted index for fast temporal query, with application to a large Covid-19 EHR dataset Open
Fast temporal query on large EHR-derived data sources presents an emerging big data challenge, as this query modality is intractable using conventional strategies that have not focused on addressing Covid-19-related research needs at scale…
View article
Content-based Image Retrieval using Tesseract OCR Engine and Levenshtein Algorithm Open
Image Retrieval Systems (IRSs) are applications that allow one to retrieve images saved at any location on a network. Most IRSs make use of reverse lookup to find images stored on the network based on image properties such as size, filenam…
View article
Dynamic Set kNN Self-Join Open
In many applications, data objects can be represented as sets. For example, in video on-demand and social network services, the user data consists of a set of movies that have been watched and a set of users (friends), respectively, and th…
View article
Learning a Complete Image Indexing Pipeline Open
To work at scale, a complete image indexing system comprises two components:\nAn inverted file index to restrict the actual search to only a subset that\nshould contain most of the items relevant to the query; An approximate distance\ncomp…
View article
Hybrid compression of inverted lists for reordered document collections Open
Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: inverted indexes. They store an inverted list per term of the vocabulary. The inverted list of a given term stores, among …
View article
Indexing of CNN Features for Large Scale Image Search Open
The convolutional neural network (CNN) features can give a good description of image content, which usually represent images with unique global vectors. Although they are compact compared to local descriptors, they still cannot efficiently…
View article
Reconfigurable Inverted Index Open
Existing approximate nearest neighbor search systems suffer from two fundamental problems that are of practical importance but have not received sufficient attention from the research community. First, although existing systems perform wel…
View article
Matching Reads to Many Genomes with the <i>r</i> -Index Open
The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. T…
View article
FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search Open
Approximate nearest neighbor search (ANNS) is a fundamental building block in information retrieval with graph-based indices being the current state-of-the-art and widely used in the industry. Recent advances in graph-based indices have ma…