malaysian-dataset by malaysia-ai

Malaysian dataset aggregator for NLP research

Created 8 years ago

339 stars

Top 81.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

This repository provides a curated collection of Malaysian datasets, primarily aimed at researchers and developers working with Malaysian language and data. It offers a centralized resource for various data types, facilitating easier access and utilization for AI and NLP projects focused on Malaysia.

How It Works

The project gathers data through web crawling of Malaysian websites and social media platforms like Twitter, Facebook, and Instagram. For text data, it employs a semi-supervised approach, leveraging teacher-student models and LLMs (ChatGPT, Mixtral, Llama3) for translation and data generation. This method allows for scaling data collection and annotation efficiently.

Quick Start & Requirements

Data is available via Hugging Face: https://huggingface.co/mesolitica and https://huggingface.co/malaysia-ai.
Detailed documentation: https://malaysian-dataset.readthedocs.io/
Crawled website list: https://github.com/users/huseinzol05/projects/1

Highlighted Details

Extensive data collection from Malaysian websites and social media.
Utilizes LLMs (ChatGPT, Mixtral, Llama3) and semi-supervised learning for data processing.
Includes translated datasets using Google Translate and custom models.
Data is hosted on Hugging Face for easy access.

Maintenance & Community

The project is associated with Mesolitica and acknowledges contributions from Im Big, LigBlou, and KeyReply for infrastructure sponsorship.

Licensing & Compatibility

The datasets are intended for non-commercial and research purposes due to the use of third-party software like Google Translate. Commercial use is restricted to avoid potential complications.

Limitations & Caveats

The datasets are primarily focused on Malaysian content and may not be suitable for general-purpose NLP tasks. Some data processing relies on third-party tools, necessitating careful review for commercial applications.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days