malaysian-dataset  by mesolitica

Malaysian dataset aggregator for NLP research

created 7 years ago
321 stars

Top 85.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a curated collection of Malaysian datasets, primarily aimed at researchers and developers working with Malaysian language and data. It offers a centralized resource for various data types, facilitating easier access and utilization for AI and NLP projects focused on Malaysia.

How It Works

The project gathers data through web crawling of Malaysian websites and social media platforms like Twitter, Facebook, and Instagram. For text data, it employs a semi-supervised approach, leveraging teacher-student models and LLMs (ChatGPT, Mixtral, Llama3) for translation and data generation. This method allows for scaling data collection and annotation efficiently.

Quick Start & Requirements

Highlighted Details

  • Extensive data collection from Malaysian websites and social media.
  • Utilizes LLMs (ChatGPT, Mixtral, Llama3) and semi-supervised learning for data processing.
  • Includes translated datasets using Google Translate and custom models.
  • Data is hosted on Hugging Face for easy access.

Maintenance & Community

The project is associated with Mesolitica and acknowledges contributions from Im Big, LigBlou, and KeyReply for infrastructure sponsorship.

Licensing & Compatibility

The datasets are intended for non-commercial and research purposes due to the use of third-party software like Google Translate. Commercial use is restricted to avoid potential complications.

Limitations & Caveats

The datasets are primarily focused on Malaysian content and may not be suitable for general-purpose NLP tasks. Some data processing relies on third-party tools, necessitating careful review for commercial applications.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.