ngram by EurekaLabsAI

N-gram language model for character-level name generation

Created 1 year ago

1,462 stars

Top 27.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Project Summary

This repository implements a character-level n-gram language model, demonstrating fundamental machine learning concepts like training, evaluation, and hyperparameter tuning. It's designed for educational purposes, targeting individuals learning about language modeling and autoregressive techniques, offering a foundational understanding before progressing to neural network models like GPT.

How It Works

The model utilizes a count-based approach to predict the next character based on the preceding n characters. It calculates probabilities from character co-occurrence statistics within a provided dataset of names. The implementation includes a grid search for optimal n-gram order and smoothing parameters, followed by sampling to generate new names. This method provides a clear, interpretable baseline for language modeling.

Quick Start & Requirements

Python: pip install numpy then python ngram.py
C: clang -O3 -o ngram ngram.c -lm then ./ngram
Prerequisites: Python 3, NumPy, C compiler (clang/gcc).
Resources: Minimal; runs quickly on standard hardware.
Links: Jupyter notebook for visualization

Highlighted Details

Demonstrates character-level tokenization and next-token prediction.
Includes hyperparameter tuning for n-gram order and smoothing.
Achieves a test perplexity of ~8.2 on the provided name dataset.
Offers both Python and significantly faster C implementations.

Maintenance & Community

No specific contributors, community links, or roadmap are detailed in the README. The project lists "TODOs" for visualization enhancements and community calls for help.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for educational purposes, and commercial use or closed-source linking compatibility is not specified.

Limitations & Caveats

The model is a basic n-gram implementation, producing some nonsensical outputs alongside reasonable names. It is primarily educational and lacks advanced features or robust error handling. The project has outstanding "TODOs" for further development.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days