DeepAnalyze  by ruc-datalab

Autonomous data science powered by an agentic LLM

Created 3 months ago
3,448 stars

Top 14.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary DeepAnalyze presents itself as the first agentic Large Language Model (LLM) designed for autonomous data science. It aims to automate the entire data science pipeline, from data preparation and analysis to modeling, visualization, and report generation, enabling open-ended data research across diverse data formats without human intervention. This project targets users seeking an automated data analysis assistant capable of producing analyst-grade research reports.

How It Works The core of DeepAnalyze is its agentic LLM architecture, which autonomously executes complex data science tasks. It supports a broad spectrum of data sources, including structured (Databases, CSV, Excel), semi-structured (JSON, XML, YAML), and unstructured (TXT, Markdown) data. This approach allows for end-to-end data processing and deep research, culminating in comprehensive reports, thereby streamlining the data science workflow.

Quick Start & Requirements To deploy locally, users must first create a Python 3.12 environment (e.g., using conda create -n deepanalyze python=3.12 -y). After activating the environment (conda activate deepanalyze), install core dependencies via pip install -r requirements.txt, ensuring torch==2.6.0, transformers==4.53.2, and vllm==0.8.5 are met. For training custom models, additional pip install -e . commands are required within specific subdirectories (deepanalyze/ms-swift/ and deepanalyze/SkyRL/). The demo interface can be launched by navigating to demo/chat, running npm install, and then executing bash start.sh. Interaction is available via a web browser at http://localhost:4000. An OpenAI-style API can be started using python demo/backend.py.

Highlighted Details

  • End-to-End Automation: Capable of autonomously handling the entire data science lifecycle, including preparation, analysis, modeling, visualization, and report generation.
  • Versatile Data Handling: Supports deep research and analysis across structured, semi-structured, and unstructured data formats.
  • Fully Open-Source: The project provides open access to its model, code, training data, and demo, facilitating deployment and extension.
  • API Access: Offers an OpenAI-style API for programmatic integration.

Maintenance & Community The project welcomes contributions, with useful issues and pull requests being incorporated into the contributor list. For inquiries, users can contact zhangshaolei98@ruc.edu.cn. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility The provided README does not specify a software license. This absence of explicit licensing information is a significant blocker for determining commercial use, derivative works, and overall compatibility with other projects.

Limitations & Caveats The user interface for the demo is noted as an initial version, with an invitation for further development. A critical limitation for adoption is the absence of a stated software license, preventing clear understanding of usage rights and restrictions.

Health Check
Last Commit

15 hours ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
8
Star History
389 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Nicolas Camara Nicolas Camara(Cofounder of Firecrawl), and
1 more.

fire-enrich by firecrawl

0.9%
1k
AI-powered data enrichment from email lists
Created 7 months ago
Updated 3 months ago
Feedback? Help us improve.