Explanipedia

Negation: A Pink Elephant in the Large Language Models' Room? Open

Tereza Vrabcová, Marek Kadlčík, Petr Sojka, Michal Štefánik, Michal Spiegel · 2025

Negations are key to determining sentence meaning, making them essential for logical reasoning. Despite their importance, negations pose a substantial challenge for large language models (LLMs) and remain underexplored. We constructed and …

Concept-aware Data Construction Improves In-context Learning of Language Models Open

Michal Štefánik, Marek Kadlčík, Petr Sojka · 2024

Computer science Geology

Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges …

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models Open

Lukáš Mikula, Michal Štefánik, Marek Petrovič, Petr Sojka · 2023

Computer science Psychology Philosophy

While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model…

Resources and Few-shot Learners for In-context Learning in Slavic Languages Open

Michal Štefánik, Marek Kadlčík, Piotr Gramacki, Petr Sojka · 2023

Computer science Economics Biology

Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English pr…

Soft Alignment Objectives for Robust Adaptation of Language Generation Open

Michal Štefánik, Marek Kadlčík, Petr Sojka · 2023

Computer science Philosophy Biology

Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application.However, the traditional adaptation by further training on in-domain data rapidly weakens the model’s ability to …

A roadmap for universal syllabic segmentation Open

Ondřej Sojka, Petr Sojka, Jakub Máca · 2023

Computer science

By replacing the internal hyphenation engine of T E X by an external Omega 2 module, we are able to solve all shortcomings related to hyphenation and to add new features: segmentation of compound words, excentricity, preferential hyphenati…

Resources and Few-shot Learners for In-context Learning in Slavic Languages Open

Michal Štefánik, Marek Kadlčík, Piotr Gramacki, Petr Sojka · 2023

Computer science Business Philosophy

Despite the rapid recent progress in creating accurate and compact in-context learners, most recent work focuses on in-context learning (ICL) for tasks in English. However, the ability to interact with users of languages outside English pr…

Soft Alignment Objectives for Robust Adaptation of Language Generation Open

Michal Štefánik, Marek Kadlčík, Petr Sojka · 2022

Computer science Mathematics Psychology

Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to…

Interpretable Gait Recognition by Granger Causality Open

Michal Balážia, Kateřina Hlaváčková‐Schindler, Petr Sojka, Claudia Plant · 2022

Computer science Mathematics Philosophy

Which joint interactions in the human gait cycle can be used as biometric characteristics? Most current methods on gait recognition suffer from the lack of interpretability. We propose an interpretable feature representation of gait sequen…

Adaptor: Objective-Centric Adaptation Framework for Language Models Open

Michal Štefánik, Vít Novotný, Nikola Groverová, Petr Sojka · 2022

Computer science Psychology Mathematics

Progress in natural language processing research is catalyzed by the possibilities given by the widespread software frameworks. This paper introduces Adaptor library that transposes the traditional model-centric approach composed of pre-tr…

When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting Open

Vít Novotný, Michal Štefánik, Eniafe Festus Ayetiran, Petr Sojka, Radim Řehůřek · 2022

Computer science Economics Philosophy

In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, t…

Adaptor: Objective-Centric Adaptation Framework for Language Models Open

Michal Štefánik, Vít Novotný, Nikola Groverová, Petr Sojka · 2022

Computer science Psychology Mathematics

This paper introduces Adaptor library, which transposes traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training process by applications of selected objectives.We…

Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations Open

Michal Růžička, Petr Sojka · 2021

Computer science Mathematics Political science

In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) a…

Regressive Ensemble for Machine Translation Quality Evaluation Open

Michal Štefánik, Vít Novotný, Petr Sojka · 2021

Computer science Chemistry Physics

This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics w…

WebMIaS on Docker: Deploying Math-Aware Search in a Single Line of Code Open

Dávid Lupták, Vít Novotný, Michal Štefánik, Petr Sojka · 2021

Computer science Mathematics

Math informational retrieval (MIR) search engines are absent in the wide-spread production use, even though documents in the STEM fields contain many mathematical formulae, which are sometimes more important than text for understanding. We…

When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting Open

Vít Novotný, Michal Štefánik, Eniafe Festus Ayetiran, Petr Sojka · 2021

Computer science Medicine Philosophy

In 2018, Mikolov et al. introduced the positional language model, which has characteristics of attention-based neural machine translation models and which achieved state-of-the-art performance on the intrinsic word analogy task. However, t…

One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages Open

Vít Novotný, Eniafe Festus Ayetiran, Dalibor Bačovský, Dávid Lupták, Michal Štefánik , et al. · 2021

Computer science

Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval.The representation precision of log-bilinea…

New Czechoslovak hyphenation patterns, word lists, and workflow Open

Petr Sojka, Ondřej Sojka · 2021

Computer science Philosophy

Space- and time-effective segmentation and hyphenation of natural languages stay at the core of every document preparation system, web browser, or mobile rendering system. We use the unreasonable effectiveness of pattern generation with pa…

Text classification with word embedding regularization and soft\n similarity measure Open

Vít Novotný, Eniafe Festus Ayetiran, Michal Štefánik, Petr Sojka · 2020

Computer science Mathematics

Since the seminal work of Mikolov et al., word embeddings have become the\npreferred word representations for many natural language processing tasks.\nDocument similarity measures extracted from word embeddings, such as the soft\ncosine me…

Text classification with word embedding regularization and soft similarity measure Open

Vít Novotný, Eniafe Festus Ayetiran, Michal Štefánik, Petr Sojka · 2020

Computer science Mathematics

Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measu…

Gait Recognition from Motion Capture Data Open

Michal Balážia, Petr Sojka · 2018

Computer science Biology Philosophy

Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This article contributes to the state of the art with a statistical approach for extracting robust gait feat…

You are how you walk: Uncooperative MoCap gait identification for video surveillance with incomplete and noisy data Open

Michal Balážia, Petr Sojka · 2017

Computer science Biology Physics

This work offers a design of a video surveillance system based on a soft biometric -- gait identification from MoCap data. The main focus is on two substantial issues of the video surveillance scenario: (1) the walkers do not cooperate in …

Gait Recognition from Motion Capture Data Open

Michal Balážia, Petr Sojka · 2017

Computer science Biology Philosophy

Gait recognition from motion capture data, as a pattern classification discipline, can be improved by the use of machine learning. This paper contributes to the state-of-the-art with a statistical approach for extracting robust gait featur…

Semantic Vector Encoding and Similarity Search Using Fulltext Search\n Engines Open

Jan Rygl, Jan Pomikálek, Radim Řehůřek, Michal Růžička, Vít Novotný , et al. · 2017

Computer science Political science Chemistry

Vector representations and vector space modeling (VSM) play a central role in\nmodern machine learning. We propose a novel approach to `vector similarity\nsearching' over dense semantic representations of words and documents that can\nbe d…

An Evaluation Framework and Database for MoCap-Based Gait Recognition Methods Open

Michal Balážia, Petr Sojka · 2017

Computer science

As a contribution to reproducible research, this paper presents a framework and a database to improve the development, evaluation and comparison of methods for gait recognition from motion capture (MoCap) data. The evaluation framework pro…

An Evaluation Framework and Database for MoCap-Based Gait Recognition Methods Open

Michal Balážia, Petr Sojka · 2017

Computer science Biology

As a contribution to reproducible research, this paper presents a framework\nand a database to improve the development, evaluation and comparison of methods\nfor gait recognition from motion capture (MoCap) data. The evaluation framework\n…

Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines Open

Jan Rygl, Jan Pomikálek, Radim Řehůřek, Michal Růžička, Vít Novotný , et al. · 2017

Computer science Chemistry Political science

Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to `vector similarity searching' over dense semantic representations of words and documents that can be depl…

Petr Sojka YOU? Author Swipe