ngram  by EurekaLabsAI

N-gram language model for character-level name generation

created 1 year ago
1,437 stars

Top 29.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository implements a character-level n-gram language model, demonstrating fundamental machine learning concepts like training, evaluation, and hyperparameter tuning. It's designed for educational purposes, targeting individuals learning about language modeling and autoregressive techniques, offering a foundational understanding before progressing to neural network models like GPT.

How It Works

The model utilizes a count-based approach to predict the next character based on the preceding n characters. It calculates probabilities from character co-occurrence statistics within a provided dataset of names. The implementation includes a grid search for optimal n-gram order and smoothing parameters, followed by sampling to generate new names. This method provides a clear, interpretable baseline for language modeling.

Quick Start & Requirements

  • Python: pip install numpy then python ngram.py
  • C: clang -O3 -o ngram ngram.c -lm then ./ngram
  • Prerequisites: Python 3, NumPy, C compiler (clang/gcc).
  • Resources: Minimal; runs quickly on standard hardware.
  • Links: Jupyter notebook for visualization

Highlighted Details

  • Demonstrates character-level tokenization and next-token prediction.
  • Includes hyperparameter tuning for n-gram order and smoothing.
  • Achieves a test perplexity of ~8.2 on the provided name dataset.
  • Offers both Python and significantly faster C implementations.

Maintenance & Community

No specific contributors, community links, or roadmap are detailed in the README. The project lists "TODOs" for visualization enhancements and community calls for help.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for educational purposes, and commercial use or closed-source linking compatibility is not specified.

Limitations & Caveats

The model is a basic n-gram implementation, producing some nonsensical outputs alongside reasonable names. It is primarily educational and lacks advanced features or robust error handling. The project has outstanding "TODOs" for further development.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
24 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.