Tabular-LLM by SpursGoZmy

Tabular LLM: LLM fine-tuning for table understanding

Created 2 years ago

632 stars

Top 52.5% on SourcePulse

Project Summary

This project aims to build large language models specifically for tabular intelligence tasks by collecting and formatting open-source datasets for instruction fine-tuning. It targets researchers and practitioners looking to enhance LLMs' capabilities in understanding and processing tabular data, offering a unified platform and curated datasets for tasks like question answering and text generation.

How It Works

The project leverages the Alpaca-CoT framework for instruction fine-tuning LLMs. It standardizes diverse tabular datasets into an instruction-following format, incorporating Chain-of-Thought (CoT) reasoning where available. The approach focuses on enhancing LLMs' comprehension of various table structures and tasks, with a commitment to open-sourcing processed data and fine-tuned models.

Quick Start & Requirements

Installation: The project is available as a branch within the Alpaca-CoT repository.
Dependencies: Requires Python and standard ML libraries. Specific LLM base models (e.g., Bloomz-7b-mt, Llama-7b-hf) are used for fine-tuning.
Resources: Fine-tuned LoRA weights are provided for models like Bloomz-7b-mt and Llama-7b-hf.
Links:
- Alpaca-CoT project: https://github.com/SpursGoZmy/Alpaca-CoT
- LLM+Table paper list: https://github.com/SpursGoZmy/LLM-Table-Paper-List
- Data and model checkpoints: https://huggingface.co/SpursGoZmy

Highlighted Details

Data Updates (2024-04-22): Includes expanded instruction templates, new datasets for table-text generation (RotoWire, WikiBIO) and basic structure understanding (TSR, TCE, TCR, RCE), and JSON output requirements for specific tasks.
Table Representation: Explores and utilizes Markdown for simple tables and HTML for complex tables with merged cells, with a fallback to splitting merged cells for Markdown.
Sample Format: Adopts a comprehensive JSON format including input, output, table_rows, table_repr, table_repr_type, table_type, and task_type.
Fine-tuned Models: Offers LoRA weights for fine-tuned models on Table Question Answering (TQA) datasets.

Maintenance & Community

Activity: Project publicly released on 2023-05-05, with significant data updates in April 2024.
Community: WeChat group available (invite required). Encourages issues and feedback.

Licensing & Compatibility

License: Not explicitly stated in the README. The base project Alpaca-CoT uses the Apache 2.0 license.
Compatibility: Data and models are intended for research and open-source LLM enhancement.

Limitations & Caveats

Some datasets are still in the older format (2023-05-08), and not all task/dataset combinations have updated sample counts. The project primarily focuses on text-based table representation, acknowledging document intelligence as an alternative for tables embedded within broader document contexts.

Tabular-LLM by SpursGoZmy

Explore Similar Projects

natural-sql by cfahlgren1

StructEqTable-Deploy by InternScience

ST-Raptor by weAIDB

Awesome-LLM-Tabular by johnnyhwu

Table-Pretraining by microsoft

Awesome-Tabular-LLMs by SpursGoZmy

XiYan-SQL by XGenerationLab

tablegpt-agent by tablegpt

TAG-Bench by TAG-Research

knowledge-table by whyhow-ai

Awesome-LLM-based-Text2SQL by DEEP-PolyU

tapas by google-research