Tabular-LLM  by SpursGoZmy

Tabular LLM: LLM fine-tuning for table understanding

Created 2 years ago
616 stars

Top 53.4% on SourcePulse

GitHubView on GitHub
Project Summary

This project aims to build large language models specifically for tabular intelligence tasks by collecting and formatting open-source datasets for instruction fine-tuning. It targets researchers and practitioners looking to enhance LLMs' capabilities in understanding and processing tabular data, offering a unified platform and curated datasets for tasks like question answering and text generation.

How It Works

The project leverages the Alpaca-CoT framework for instruction fine-tuning LLMs. It standardizes diverse tabular datasets into an instruction-following format, incorporating Chain-of-Thought (CoT) reasoning where available. The approach focuses on enhancing LLMs' comprehension of various table structures and tasks, with a commitment to open-sourcing processed data and fine-tuned models.

Quick Start & Requirements

Highlighted Details

  • Data Updates (2024-04-22): Includes expanded instruction templates, new datasets for table-text generation (RotoWire, WikiBIO) and basic structure understanding (TSR, TCE, TCR, RCE), and JSON output requirements for specific tasks.
  • Table Representation: Explores and utilizes Markdown for simple tables and HTML for complex tables with merged cells, with a fallback to splitting merged cells for Markdown.
  • Sample Format: Adopts a comprehensive JSON format including input, output, table_rows, table_repr, table_repr_type, table_type, and task_type.
  • Fine-tuned Models: Offers LoRA weights for fine-tuned models on Table Question Answering (TQA) datasets.

Maintenance & Community

  • Activity: Project publicly released on 2023-05-05, with significant data updates in April 2024.
  • Community: WeChat group available (invite required). Encourages issues and feedback.

Licensing & Compatibility

  • License: Not explicitly stated in the README. The base project Alpaca-CoT uses the Apache 2.0 license.
  • Compatibility: Data and models are intended for research and open-source LLM enhancement.

Limitations & Caveats

Some datasets are still in the older format (2023-05-08), and not all task/dataset combinations have updated sample counts. The project primarily focuses on text-based table representation, acknowledging document intelligence as an alternative for tables embedded within broader document contexts.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Andreas Jansson Andreas Jansson(Cofounder of Replicate).

natural-sql by cfahlgren1

0%
866
Text-to-SQL LLMs with strong performance
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

tapas by google-research

0.1%
1k
Table QA models for end-to-end neural table-text understanding
Created 5 years ago
Updated 1 year ago
Feedback? Help us improve.