Raviraj Joshi
YOU?
Author Swipe
View article: L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models
L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models Open
We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 1…
View article: L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models
L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models Open
Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The train…
View article: Topic Modeling in Marathi
Topic Modeling in Marathi Open
While topic modeling in English has become a prevalent and well-explored area, venturing into topic modeling for Indic languages remains relatively rare. The limited availability of resources, diverse linguistic structures, and unique chal…
View article: L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi
L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi Open
View article: On Importance of Layer Pruning for Smaller BERT Models and Low Resource Languages
On Importance of Layer Pruning for Smaller BERT Models and Low Resource Languages Open
View article: On Importance of Code-Mixed Embeddings for Hate Speech Identification
On Importance of Code-Mixed Embeddings for Hate Speech Identification Open
Code-mixing is the practice of using two or more languages in a single sentence, which often occurs in multilingual communities such as India where people commonly speak multiple languages. Classic NLP tools, trained on monolingual data, f…
View article: Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning
Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning Open
Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, yet challenges persist in adapting these models for low-resource languages. In this study, we investigate the effects of Low-Rank Adaptation (LoRA) Parame…
View article: On Limitations of LLM as Annotator for Low Resource Languages
On Limitations of LLM as Annotator for Low Resource Languages Open
Low-resource languages face significant challenges due to the lack of sufficient linguistic data, resources, and tools for tasks such as supervised learning, annotation, and classification. This shortage hinders the development of accurate…
View article: Non-Contextual BERT or FastText? A Comparative Analysis
Non-Contextual BERT or FastText? A Comparative Analysis Open
Natural Language Processing (NLP) for low-resource languages, which lack large annotated datasets, faces significant challenges due to limited high-quality data and linguistic resources. The selection of embeddings plays a critical role in…
View article: Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus
Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus Open
Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-base…
View article: Long Range Named Entity Recognition for Marathi Documents
Long Range Named Entity Recognition for Marathi Documents Open
The demand for sophisticated natural language processing (NLP) methods, particularly Named Entity Recognition (NER), has increased due to the exponential growth of Marathi-language digital content. In particular, NER is essential for recog…
View article: L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi
L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi Open
We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi, designed to facilitate the training and evaluation of models for abstractive summarization tasks in Indic languages. The dataset, containing 25k …
View article: Automated Assessment of Multimodal Answer Sheets in the STEM domain
Automated Assessment of Multimodal Answer Sheets in the STEM domain Open
In the domain of education, the integration of,technology has led to a transformative era, reshaping traditional,learning paradigms. Central to this evolution is the automation,of grading processes, particularly within the STEM domain enco…
View article: On Importance of Pruning and Distillation for Efficient Low Resource NLP
On Importance of Pruning and Distillation for Efficient Low Resource NLP Open
The rise of large transformer models has revolutionized Natural Language Processing, leading to significant advances in tasks like text classification. However, this progress demands substantial computational resources, escalating training…
View article: L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context
L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context Open
Large Language Models (LLMs) have made significant progress in incorporating Indic languages within multilingual models. However, it is crucial to quantitatively assess whether these languages perform comparably to globally dominant ones, …
View article: Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages
Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages Open
This paper introduces Chain of Translation Prompting (CoTR), a novel strategy designed to enhance the performance of language models in low-resource languages. CoTR restructures prompts to first translate the input context from a low-resou…
View article: Compact Language Models via Pruning and Knowledge Distillation
Compact Language Models via Pruning and Knowledge Distillation Open
Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and th…
View article: MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering
MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering Open
Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-re…
View article: A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross Lingual Sentence Representations
A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross Lingual Sentence Representations Open
Machine translation in low-resource language pairs faces significant\nchallenges due to the scarcity of parallel corpora and linguistic resources.\nThis study focuses on the case of English-Marathi language pairs, where\nexisting datasets …
View article: Curating Stopwords in Marathi - A TF-IDF Approach
Curating Stopwords in Marathi - A TF-IDF Approach Open
Stopwords are commonly used words in a language that are often considered to be of little value in determining the meaning or significance of a document. These words occur frequently in most texts and don't provide much useful information …
View article: Leveraging Parameter Efficient Training Methods for Low Resource Text Classification: A case study in Marathi
Leveraging Parameter Efficient Training Methods for Low Resource Text Classification: A case study in Marathi Open
With the surge in digital content in low-resource languages, there is an\nescalating demand for advanced Natural Language Processing (NLP) techniques\ntailored to these languages. BERT (Bidirectional Encoder Representations from\nTransform…
View article: L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages
L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages Open
In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work …
View article: TextGram: Towards a Better Domain-Adaptive Pretraining
TextGram: Towards a Better Domain-Adaptive Pretraining Open
View article: L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi
L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi Open
View article: Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning
Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning Open
Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in…
View article: SenTest: Evaluating Robustness of Sentence Encoders
SenTest: Evaluating Robustness of Sentence Encoders Open
Contrastive learning has proven to be an effective method for pre-training models using weakly labeled data in the vision domain. Sentence transformers are the NLP counterparts to this architecture, and have been growing in popularity due …
View article: Language augmentation approach for code-mixed text classification
Language augmentation approach for code-mixed text classification Open
The usage of more than one language in the same text is referred to as Code Mixed. It is evident that there is a growing degree of adaption of the use of code-mixed data, especially English with a regional language, on social media platfor…
View article: mahaNLP: A Marathi Natural Language Processing Library
mahaNLP: A Marathi Natural Language Processing Library Open
We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language. It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP. It is an easy-to-use…
View article: Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages
Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages Open
In our increasingly interconnected digital world, social media platforms have emerged as powerful channels for the dissemination of hate speech and offensive content. This work delves into the domain of hate speech detection, placing speci…
View article: Robust Sentiment Analysis for Low Resource languages Using Data Augmentation Approaches: A Case Study in Marathi
Robust Sentiment Analysis for Low Resource languages Using Data Augmentation Approaches: A Case Study in Marathi Open
Sentiment analysis plays a crucial role in understanding the sentiment expressed in text data. While sentiment analysis research has been extensively conducted in English and other Western languages, there exists a significant gap in resea…