Malaysian dataset aggregator for NLP research
Top 85.7% on sourcepulse
This repository provides a curated collection of Malaysian datasets, primarily aimed at researchers and developers working with Malaysian language and data. It offers a centralized resource for various data types, facilitating easier access and utilization for AI and NLP projects focused on Malaysia.
How It Works
The project gathers data through web crawling of Malaysian websites and social media platforms like Twitter, Facebook, and Instagram. For text data, it employs a semi-supervised approach, leveraging teacher-student models and LLMs (ChatGPT, Mixtral, Llama3) for translation and data generation. This method allows for scaling data collection and annotation efficiently.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is associated with Mesolitica and acknowledges contributions from Im Big, LigBlou, and KeyReply for infrastructure sponsorship.
Licensing & Compatibility
The datasets are intended for non-commercial and research purposes due to the use of third-party software like Google Translate. Commercial use is restricted to avoid potential complications.
Limitations & Caveats
The datasets are primarily focused on Malaysian content and may not be suitable for general-purpose NLP tasks. Some data processing relies on third-party tools, necessitating careful review for commercial applications.
1 month ago
1 day