Tabular-LLM  by SpursGoZmy

Tabular LLM: LLM fine-tuning for table understanding

created 2 years ago
605 stars

Top 54.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project aims to build large language models specifically for tabular intelligence tasks by collecting and formatting open-source datasets for instruction fine-tuning. It targets researchers and practitioners looking to enhance LLMs' capabilities in understanding and processing tabular data, offering a unified platform and curated datasets for tasks like question answering and text generation.

How It Works

The project leverages the Alpaca-CoT framework for instruction fine-tuning LLMs. It standardizes diverse tabular datasets into an instruction-following format, incorporating Chain-of-Thought (CoT) reasoning where available. The approach focuses on enhancing LLMs' comprehension of various table structures and tasks, with a commitment to open-sourcing processed data and fine-tuned models.

Quick Start & Requirements

Highlighted Details

  • Data Updates (2024-04-22): Includes expanded instruction templates, new datasets for table-text generation (RotoWire, WikiBIO) and basic structure understanding (TSR, TCE, TCR, RCE), and JSON output requirements for specific tasks.
  • Table Representation: Explores and utilizes Markdown for simple tables and HTML for complex tables with merged cells, with a fallback to splitting merged cells for Markdown.
  • Sample Format: Adopts a comprehensive JSON format including input, output, table_rows, table_repr, table_repr_type, table_type, and task_type.
  • Fine-tuned Models: Offers LoRA weights for fine-tuned models on Table Question Answering (TQA) datasets.

Maintenance & Community

  • Activity: Project publicly released on 2023-05-05, with significant data updates in April 2024.
  • Community: WeChat group available (invite required). Encourages issues and feedback.

Licensing & Compatibility

  • License: Not explicitly stated in the README. The base project Alpaca-CoT uses the Apache 2.0 license.
  • Compatibility: Data and models are intended for research and open-source LLM enhancement.

Limitations & Caveats

Some datasets are still in the older format (2023-05-08), and not all task/dataset combinations have updated sample counts. The project primarily focuses on text-based table representation, acknowledging document intelligence as an alternative for tables embedded within broader document contexts.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
39 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.7%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Feedback? Help us improve.