State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a …
Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in 'lower-resource' scenarios such as dialects/sociolects (national or social varieties of a language), …
This paper investigates data sampling strategies to create a benchmark for dialectal sentiment classification of Google Places reviews written in English. Based on location-based filtering, we collect a self-supervised dataset of reviews in …
Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in 'lower-resource' scenarios such as dialects/sociolects (national or social varieties of a language), …