Assessing Variations in Open Datasets for Training Large Language Models: Biases and Benchmarking

##plugins.themes.academic_pro.article.main##

Vincent Koc

Abstract

Open datasets are critical to the development and training of large language models (LLMs). However, variations in dataset composition often introduce biases that can impact model performance and reliability. This Article investigates the nature and extent of these variations, categorizes biases inherent in datasets, and examines their implications on LLM training. We also evaluate benchmarking standards currently employed to measure LLM performance and propose enhancements for a fairer and more inclusive evaluation framework. Through extensive experiments and analyses, we reveal the consequences of dataset heterogeneity and demonstrate practical strategies for mitigating biases. Our findings emphasize the importance of transparent dataset curation and robust benchmarking practices to ensure the ethical development of LLMs.

##plugins.themes.academic_pro.article.details##

How to Cite
Vincent Koc. (2024). Assessing Variations in Open Datasets for Training Large Language Models: Biases and Benchmarking. Pioneer Research Journal of Computing Science, 1(1), 83–92. Retrieved from https://prjcs.com/index.php/prjcs/article/view/20