Malte Ostendorff
YOU?
Author Swipe
View article: MMTEB: Massive Multilingual Text Embedding Benchmark
MMTEB: Massive Multilingual Text Embedding Benchmark Open
Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingu…
View article: Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data
Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data Open
Tables are among the most widely used tools for representing structured data in research, business, medicine, and education. Although LLMs demonstrate strong performance in downstream tasks, their efficiency in processing tabular data rema…
View article: Rapidly developing NLP applications for content curation
Rapidly developing NLP applications for content curation Open
Time and again we are faced, in a number of collaborative research projects, with the challenge of interconnecting various language processing tools to implement certain industry-driven use cases focusing, for the most part, upon digital c…
View article: Rapidly developing NLP applications for content curation
Rapidly developing NLP applications for content curation Open
Time and again we are faced, in a number of collaborative research projects, with the challenge of interconnecting various language processing tools to implement certain industry-driven use cases focusing, for the most part, upon digital c…
View article: Reward Modeling with Weak Supervision for Language Models
Reward Modeling with Weak Supervision for Language Models Open
Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user in…
View article: Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family Open
This paper presents a comprehensive overview of the data preparation pipeline developed for the OpenGPT-X project, a large-scale initiative aimed at creating open and high-performance multilingual large language models (LLMs). The project …
View article: Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs Open
We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-Englis…
View article: Symmetric Dot-Product Attention for Efficient Training of BERT Language Models
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models Open
Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language pr…
View article: Investigating Gender Bias in Turkish Language Models
Investigating Gender Bias in Turkish Language Models Open
Language models are trained mostly on Web data, which often contains social stereotypes and biases that the models can inherit. This has potentially negative consequences, as models can amplify these biases in downstream tasks or applicati…
View article: Tokenizer Choice For LLM Training: Negligible or Crucial?
Tokenizer Choice For LLM Training: Negligible or Crucial? Open
3907
View article: Tokenizer Choice For LLM Training: Negligible or Crucial?
Tokenizer Choice For LLM Training: Negligible or Crucial? Open
The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer i…
View article: Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning Open
Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources incre…
View article: Aspect-based Document Similarity for Literature Recommender Systems
Aspect-based Document Similarity for Literature Recommender Systems Open
Literaturempfehlungssysteme unterstützen den Leser relevanten Dokumente zu finden. Dabei nutzen inhaltsbasierte Systeme sog. Dokumentenähnlichkeitsmaße. Die alleinige Unterscheidung zwischen ähnlichen und unähnlichen Dokumenten vernachläss…
View article: AspectCSE: Sentence Embeddings for Aspect-based Semantic Textual Similarity Using Contrastive Learning and Structured Knowledge
AspectCSE: Sentence Embeddings for Aspect-based Semantic Textual Similarity Using Contrastive Learning and Structured Knowledge Open
Generic sentence embeddings provide a coarse-grained approximation of\nsemantic textual similarity but ignore specific aspects that make texts\nsimilar. Conversely, aspect-based sentence embeddings provide similarities\nbetween texts based…
View article: Specialized document embeddings for aspect-based similarity of research papers
Specialized document embeddings for aspect-based similarity of research papers Open
German Federal Ministry of Education and Research (BMBF)
View article: Specialized Document Embeddings for Aspect-based Similarity of Research Papers
Specialized Document Embeddings for Aspect-based Similarity of Research Papers Open
Document embeddings and similarity measures underpin content-based recommender systems, whereby a document is commonly represented as a single generic embedding. However, similarity computed on single vector representations provides only o…
View article: HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information Open
Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, secti…
View article: Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings Open
Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. P…
View article: Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings Open
Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. P…
View article: HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information Open
Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.e., parts of a text can be identified using their position in this hierarchy. In addition, secti…
View article: ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open?
ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open? Open
The zbMATH database contains more than 4 million bibliographic entries. We aim to provide easy access to these entries. Therefore, we maintain different index structures, including a formula index. To optimize the findability of the entrie…
View article: A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles
A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles Open
Literature recommendation systems (LRS) assist readers in the discovery of relevant content from the overwhelming amount of literature available. Despite the widespread adoption of LRS, there is a lack of research on the user-perceived rec…
View article: Ordering sentences and paragraphs with pre-trained encoder-decoder transformers and pointer ensembles
Ordering sentences and paragraphs with pre-trained encoder-decoder transformers and pointer ensembles Open
Passage ordering aims to maximise discourse coherence in document generation or document modification tasks such as summarisation or storytelling. This paper extends the passage ordering task from sentences to paragraphs, i.e., passages wi…
View article: Evaluating document representations for content-based legal literature recommendations
Evaluating document representations for content-based legal literature recommendations Open
Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representa…
View article: Fine-grained Classification of Political Bias in German News: A Data Set and Initial Experiments
Fine-grained Classification of Political Bias in German News: A Data Set and Initial Experiments Open
We present a data set consisting of German news articles labeled for political bias on a five-point scale in a semi-supervised way. While earlier work on hyperpartisan news detection uses binary classification (i.e., hyperpartisan or not) …
View article: Aspect-based Document Similarity for Research Papers (Dataset, Models & Code)
Aspect-based Document Similarity for Research Papers (Dataset, Models & Code) Open
Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications …
View article: Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code)
Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code) Open
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this p…