InsTag by OFA-Sys

Tool for analyzing supervised fine-tuning (SFT) data in LLMs

Created 2 years ago

284 stars

Top 92.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

InsTag is a tool designed to analyze and improve supervised fine-tuning (SFT) datasets for large language models (LLMs). It addresses the need for quantitative analysis of instruction diversity and complexity in SFT data, offering a method to create more effective datasets and fine-tuned models. The project targets LLM researchers and developers working with SFT data.

How It Works

InsTag introduces a fine-grained tagging system to categorize SFT data samples based on semantics and intentions, generating approximately 6.6K unique tags. This tagging mechanism allows for the definition and measurement of instruction diversity and complexity. The project leverages these insights to develop a data selector that samples diverse and complex instructions, which are then used to fine-tune LLMs. This approach aims to enhance model performance with smaller, more curated datasets.

Quick Start & Requirements

InsTagger (Local Tagging): Download weights from Hugging Face Hub and use FastChat for serving or inference.
TagLM Models: Download weights from Hugging Face Hub. Fine-tuning was performed using the FastChat codebase with the Vicuna V1.1 system template.
Prerequisites: LLaMA or LLaMA-2 models, FastChat.
Demo: An online demo of InsTagger is available via ModelScope.

Highlighted Details

TagLM-13B-v2.0, fine-tuned on a 6K InsTag-selected subset, outperforms many open-source LLMs on MT-Bench.
InsTagger is a LLaMA-2 based model fine-tuned on InsTag's tagging results.
The project provides checkpoints for both the InsTagger and TagLM models.

Maintenance & Community

The project released its paper and model checkpoints in August 2023. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

InsTagger is released under the LLaMA 2 License.
TagLM-13B-v1.0 uses the LLaMA License, and TagLM-13B-v2.0 uses the LLaMA 2 License.
All models are based on LLaMA or LLaMA-2 and must adhere to their respective licenses. Compatibility with commercial or closed-source projects depends on the terms of the LLaMA/LLaMA-2 licenses.

Limitations & Caveats

The project's primary focus is on analyzing and improving SFT data for LLMs; it does not appear to offer a comprehensive suite for all aspects of LLM development. The performance claims are based on MT-Bench evaluations using GPT-4 as a judge.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days