Petr Sojka
YOU?
Author Swipe
View article: Negation: A Pink Elephant in the Large Language Models' Room?
Negation: A Pink Elephant in the Large Language Models' Room? Open
Negations are key to determining sentence meaning, making them essential for logical reasoning. Despite their importance, negations pose a substantial challenge for large language models (LLMs) and remain underexplored. We constructed and …
View article: Concept-aware Data Construction Improves In-context Learning of Language Models
Concept-aware Data Construction Improves In-context Learning of Language Models Open
Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges …
View article: Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models
Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models Open
While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model…
View article: Resources and Few-shot Learners for In-context Learning in Slavic Languages
Resources and Few-shot Learners for In-context Learning in Slavic Languages Open
Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English pr…
View article: Soft Alignment Objectives for Robust Adaptation of Language Generation
Soft Alignment Objectives for Robust Adaptation of Language Generation Open
Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application.However, the traditional adaptation by further training on in-domain data rapidly weakens the model’s ability to …
View article: A roadmap for universal syllabic segmentation
A roadmap for universal syllabic segmentation Open
By replacing the internal hyphenation engine of T E X by an external Omega 2 module, we are able to solve all shortcomings related to hyphenation and to add new features: segmentation of compound words, excentricity, preferential hyphenati…
View article: Resources and Few-shot Learners for In-context Learning in Slavic Languages
Resources and Few-shot Learners for In-context Learning in Slavic Languages Open
Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English pr…
View article: Soft Alignment Objectives for Robust Adaptation of Language Generation
Soft Alignment Objectives for Robust Adaptation of Language Generation Open
Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to…
View article: Interpretable Gait Recognition by Granger Causality
Interpretable Gait Recognition by Granger Causality Open
Which joint interactions in the human gait cycle can be used as biometric characteristics? Most current methods on gait recognition suffer from the lack of interpretability. We propose an interpretable feature representation of gait sequen…
View article: Adaptor: Objective-Centric Adaptation Framework for Language Models
Adaptor: Objective-Centric Adaptation Framework for Language Models Open
Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-tr…
View article: When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting
When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting Open
In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, t…
View article: Adaptor: Objective-Centric Adaptation Framework for Language Models
Adaptor: Objective-Centric Adaptation Framework for Language Models Open
This paper introduces Adaptor library, which transposes traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives.We…
View article: Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations
Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations Open
In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) a…
View article: Regressive Ensemble for Machine Translation Quality Evaluation
Regressive Ensemble for Machine Translation Quality Evaluation Open
This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics w…
View article: WebMIaS on Docker: Deploying Math-Aware Search in a Single Line of Code
WebMIaS on Docker: Deploying Math-Aware Search in a Single Line of Code Open
Math informational retrieval (MIR) search engines are absent in the wide-spread production use, even though documents in the STEM fields contain many mathematical formulae, which are sometimes more important than text for understanding. We…
View article: When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting
When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting Open
In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, t…
View article: One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages
One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages Open
Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval.The representation precision of log-bilinea…
View article: New Czechoslovak hyphenation patterns, word lists, and workflow
New Czechoslovak hyphenation patterns, word lists, and workflow Open
Space- and time-effective segmentation and hyphenation of natural languages stay at the core of every document preparation system, web browser, or mobile rendering system. We use the unreasonable effectiveness of pattern generation with pa…
View article: Text classification with word embedding regularization and soft\n similarity measure
Text classification with word embedding regularization and soft\n similarity measure Open
Since the seminal work of Mikolov et al., word embeddings have become the\npreferred word representations for many natural language processing tasks.\nDocument similarity measures extracted from word embeddings, such as the soft\ncosine me…
View article: Text classification with word embedding regularization and soft similarity measure
Text classification with word embedding regularization and soft similarity measure Open
Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measu…
View article: Gait Recognition from Motion Capture Data
Gait Recognition from Motion Capture Data Open
Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This article contributes to the state of the art with a statistical approach for extracting robust gait feat…
View article: You are how you walk: Uncooperative MoCap gait identification for video surveillance with incomplete and noisy data
You are how you walk: Uncooperative MoCap gait identification for video surveillance with incomplete and noisy data Open
This work offers a design of a video surveillance system based on a soft biometric -- gait identification from MoCap data. The main focus is on two substantial issues of the video surveillance scenario: (1) the walkers do not cooperate in …
View article: Gait Recognition from Motion Capture Data
Gait Recognition from Motion Capture Data Open
Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This paper contributes to the state-of-the-art with a statistical approach for extracting robust gait featur…
View article: Semantic Vector Encoding and Similarity Search Using Fulltext Search\n Engines
Semantic Vector Encoding and Similarity Search Using Fulltext Search\n Engines Open
Vector representations and vector space modeling (VSM) play a central role in\nmodern machine learning. We propose a novel approach to `vector similarity\nsearching' over dense semantic representations of words and documents that can\nbe d…
View article: An Evaluation Framework and Database for MoCap-Based Gait Recognition Methods
An Evaluation Framework and Database for MoCap-Based Gait Recognition Methods Open
As a contribution to reproducible research, this paper presents a framework and a database to improve the development, evaluation and comparison of methods for gait recognition from motion capture (MoCap) data. The evaluation framework pro…
View article: An Evaluation Framework and Database for MoCap-Based Gait Recognition Methods
An Evaluation Framework and Database for MoCap-Based Gait Recognition Methods Open
As a contribution to reproducible research, this paper presents a framework\nand a database to improve the development, evaluation and comparison of methods\nfor gait recognition from motion capture (MoCap) data. The evaluation framework\n…
View article: Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines
Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines Open
Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to `vector similarity searching' over dense semantic representations of words and documents that can be depl…