Explanipedia

L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models Open

Aishwarya Mirashi, Ananya Joshi, Raviraj Joshi · 2025

We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 1…

L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models Open

Nidhi Kowtal, Raviraj Joshi · 2025

Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The train…

Topic Modeling in Marathi Open

S. B. Shinde, Raviraj Joshi · 2025

While topic modeling in English has become a prevalent and well-explored area, venturing into topic modeling for Indic languages remains relatively rare. The limited availability of resources, diverse linguistic structures, and unique chal…

L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi Open

Nikita Kulkarni, Kareena Manghani, Sanhita Kulkarni, Pranita Deshmukh, Raviraj Joshi · 2024

On Importance of Layer Pruning for Smaller BERT Models and Low Resource Languages Open

Raviraj Joshi, Amey Shembade, Mayur Shirke, Madhushri Wagh, Pavan Thorat · 2024

On Importance of Code-Mixed Embeddings for Hate Speech Identification Open

Shruti Jagdale, Omkar Khade, Gauri Takalikar, Mihir Inamdar, Raviraj Joshi · 2024

Code-mixing is the practice of using two or more languages in a single sentence, which often occurs in multilingual communities such as India where people commonly speak multiple languages. Classic NLP tools, trained on monolingual data, f…

Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning Open

Omkar Khade, Shruti Jagdale, Abhishek Phaltankar, Gauri Takalikar, Raviraj Joshi · 2024

Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, yet challenges persist in adapting these models for low-resource languages. In this study, we investigate the effects of Low-Rank Adaptation (LoRA) Parame…

On Limitations of LLM as Annotator for Low Resource Languages Open

Suramya Jadhav, A.G. Shanbhag, Amogh Thakurdesai, Ridhima Sinare, Raviraj Joshi · 2024

Low-resource languages face significant challenges due to the lack of sufficient linguistic data, resources, and tools for tasks such as supervised learning, annotation, and classification. This shortage hinders the development of accurate…

Non-Contextual BERT or FastText? A Comparative Analysis Open

A.G. Shanbhag, Suramya Jadhav, Amogh Thakurdesai, Ridhima Sinare, Raviraj Joshi · 2024

Natural Language Processing (NLP) for low-resource languages, which lack large annotated datasets, faces significant challenges due to limited high-quality data and linguistic resources. The selection of embeddings plays a critical role in…

Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus Open

Raviraj Joshi, Kamal Singla, Anusha Kamath, Raunak Kalani, Rakesh Paul , et al. · 2024

Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-base…

Long Range Named Entity Recognition for Marathi Documents Open

Pranita Deshmukh, Nikita Kulkarni, Sanhita Kulkarni, Kareena Manghani, Geetanjali Kale , et al. · 2024

The demand for sophisticated natural language processing (NLP) methods, particularly Named Entity Recognition (NER), has increased due to the exponential growth of Marathi-language digital content. In particular, NER is essential for recog…

L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi Open

Pranita Deshmukh, Nikita Kulkarni, Sanhita Kulkarni, Kareena Manghani, Raviraj Joshi · 2024

We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi, designed to facilitate the training and evaluation of models for abstractive summarization tasks in Indic languages. The dataset, containing 25k …

Automated Assessment of Multimodal Answer Sheets in the STEM domain Open

Rajlaxmi Patil, Aditya Kulkarni, Ruturaj Ghatage, Sharvi Endait, Geetanjali Kale , et al. · 2024

In the domain of education, the integration of,technology has led to a transformative era, reshaping traditional,learning paradigms. Central to this evolution is the automation,of grading processes, particularly within the STEM domain enco…

On Importance of Pruning and Distillation for Efficient Low Resource NLP Open

Aishwarya Mirashi, Purva Lingayat, Srushti Sonavane, Tejas Padhiyar, Raviraj Joshi , et al. · 2024

The rise of large transformer models has revolutionized Natural Language Processing, leading to significant advances in tasks like text classification. However, this progress demands substantial computational resources, escalating training…

L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context Open

Pritika Rohera, Chaitrali Ginimav, Akanksha Salunke, Gayatri Sawant, Raviraj Joshi · 2024

Large Language Models (LLMs) have made significant progress in incorporating Indic languages within multilingual models. However, it is crucial to quantitatively assess whether these languages perform comparably to globally dominant ones, …

Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages Open

Tejas Deshpande, Nidhi Kowtal, Raviraj Joshi · 2024

This paper introduces Chain of Translation Prompting (CoTR), a novel strategy designed to enhance the performance of language models in low-resource languages. CoTR restructures prompts to first translate the input context from a low-resou…

Compact Language Models via Pruning and Knowledge Distillation Open

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary , et al. · 2024

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and th…

MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering Open

Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil, Sharvi Endait, Raviraj Joshi · 2024

Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-re…

A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross Lingual Sentence Representations Open

Nidhi Kowtal, Tejas Deshpande, Raviraj Joshi · 2024

Machine translation in low-resource language pairs faces significant\nchallenges due to the scarcity of parallel corpora and linguistic resources.\nThis study focuses on the case of English-Marathi language pairs, where\nexisting datasets …

Curating Stopwords in Marathi - A TF-IDF Approach Open

Rohan Chavan, Vishal Madle, Gaurav Patil, Raviraj Joshi · 2024

Stopwords are commonly used words in a language that are often considered to be of little value in determining the meaning or significance of a document. These words occur frequently in most texts and don't provide much useful information …

Leveraging Parameter Efficient Training Methods for Low Resource Text Classification: A case study in Marathi Open

Pranita Deshmukh, Nikita Kulkarni, Sanhita Kulkarni, Kareena Manghani, Raviraj Joshi · 2024

With the surge in digital content in low-resource languages, there is an\nescalating demand for advanced Natural Language Processing (NLP) techniques\ntailored to these languages. BERT (Bidirectional Encoder Representations from\nTransform…

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages Open

Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi · 2024

In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work …

TextGram: Towards a Better Domain-Adaptive Pretraining Open

Sharayu Hiwarkhedkar, Saloni Mittal, Vidula Magdum, Omkar Dhekane, Raviraj Joshi , et al. · 2024

L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi Open

Saloni Mittal, Vidula Magdum, Sharayu Hiwarkhedkar, Omkar Dhekane, Raviraj Joshi · 2024

Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning Open

Raviraj Joshi, Nikesh Garera · 2023

Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in…

SenTest: Evaluating Robustness of Sentence Encoders Open

Tanmay Chavan, Shantanu Patankar, Aditya Kane, Omkar Gokhale, Geetanjali Kale , et al. · 2023

Contrastive learning has proven to be an effective method for pre-training models using weakly labeled data in the vision domain. Sentence transformers are the NLP counterparts to this architecture, and have been growing in popularity due …

Language augmentation approach for code-mixed text classification Open

Gauri Takawane, Abhishek Phaltankar, Varad Patwardhan, Aryan Patil, Raviraj Joshi , et al. · 2023

The usage of more than one language in the same text is referred to as Code Mixed. It is evident that there is a growing degree of adaption of the use of code-mixed data, especially English with a regional language, on social media platfor…

mahaNLP: A Marathi Natural Language Processing Library Open

Vidula Magdum, Omkar Dhekane, Sharayu Hiwarkhedkar, Saloni Mittal, Raviraj Joshi · 2023

We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language. It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP. It is an easy-to-use…

Harnessing Pre-Trained Sentence Transformers for Offensive Language Detection in Indian Languages Open

Ananya Joshi, Raviraj Joshi · 2023

In our increasingly interconnected digital world, social media platforms have emerged as powerful channels for the dissemination of hate speech and offensive content. This work delves into the domain of hate speech detection, placing speci…

Robust Sentiment Analysis for Low Resource languages Using Data Augmentation Approaches: A Case Study in Marathi Open

Aabha Pingle, Aditya Vyawahare, Isha Joshi, Rahul Tangsali, Geetanjali Kale , et al. · 2023

Sentiment analysis plays a crucial role in understanding the sentiment expressed in text data. While sentiment analysis research has been extensively conducted in English and other Western languages, there exists a significant gap in resea…

Raviraj Joshi YOU? Author Swipe