sentiment-discovery by NVIDIA

Language modeling and sentiment classification in PyTorch (deprecated, see Megatron-LM)

Created 8 years ago

1,067 stars

Top 35.5% on SourcePulse

View on GitHub

6 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Luis Capelo

Cofounder of Lightning AI

and 2 more!

Project Summary

This repository provides code for unsupervised language modeling and sentiment classification, targeting researchers and practitioners in NLP. It enables training state-of-the-art classification models on custom datasets and reproducing results from NVIDIA's large-scale pretraining and transfer learning papers.

How It Works

The project leverages PyTorch for implementing Transformer and mLSTM language models. It supports unsupervised pretraining on large text corpora, followed by transfer learning or end-to-end finetuning for classification tasks. Key advantages include mixed-precision (FP16) training and distributed, multi-GPU, multi-node training capabilities, building upon the NVIDIA APEx project for scalability and efficiency.

Quick Start & Requirements

Install via python3 setup.py install.
Requires Python 3, NumPy, PyTorch (>= 0.4.1), Pandas, Scikit-learn, Matplotlib, Unidecode, SentencePiece, Seaborn, and Emoji.
Pretrained models and tokenizers are available for download.
Official documentation: https://github.com/NVIDIA/sentiment-discovery/blob/master/docs/README.md

Highlighted Details

Supports training and inference with FP16 for improved performance and reduced memory footprint.
Offers distributed training across multiple nodes and GPUs for large-scale model training.
Includes pretrained models for sentiment classification (SST, IMDB) and emotion classification (SemEval).
Provides scripts for sentiment classification, language modeling, transfer learning, and text generation.

Maintenance & Community

The project acknowledges contributions from Neel Kant, @csarofeen, and @Michael Carilli. It references the APEx GitHub page for utilities. The project is built using the Amazon review dataset collected by J. McAuley.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This repository is DEPRECATED. Users are directed to Megatron-LM for up-to-date code. To use this codebase, one must rely on tagged releases and ensure compatibility with the software and dependencies available at that specific date.

Health Check

Last Commit

5 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days