seatunnel  by apache

High-performance multimodal data integration

Created 8 years ago
9,285 stars

Top 5.6% on SourcePulse

GitHubView on GitHub
Project Summary

Apache SeaTunnel is a multimodal, high-performance, distributed data integration platform designed for synchronizing vast amounts of data daily. It targets data engineers and developers dealing with diverse data sources and complex synchronization scenarios, offering efficient resource utilization and robust data quality monitoring.

How It Works

SeaTunnel employs a distributed snapshot algorithm for data consistency and supports multiple execution engines including its native Zeta Engine, Apache Spark, and Apache Flink. It features JDBC multiplexing and log parsing for efficient multi-table and database synchronization, enabling high throughput and low latency. The platform supports batch-stream integration and offers over 100 connectors for various data sources, sinks, and transformations.

Quick Start & Requirements

  • Download SeaTunnel from the Official Website.
  • Requires selection of an execution engine (Zeta Engine, Spark, or Flink).
  • Refer to Installation Guide for detailed setup.

Highlighted Details

  • Supports integration of video, images, and binary files alongside structured and unstructured text data.
  • Offers over 100 connectors and is actively expanding its ecosystem.
  • Provides two job development methods: coding and visual management via the SeaTunnel Web Project.
  • Used by companies like Weibo, Tencent Cloud, and Sina.

Maintenance & Community

  • Active community with a Slack channel available: SeaTunnel Slack.
  • Contributions are welcomed via GitHub Repository.
  • Contact via mailing list: dev@seatunnel.apache.org.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License, permitting commercial use.

Limitations & Caveats

  • While supporting multimodal data, detailed instructions for video, image, and binary file integration are found in separate documentation.
Health Check
Last Commit

19 hours ago

Responsiveness

Inactive

Pull Requests (30d)
125
Issues (30d)
79
Star History
88 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

data-juicer by datajuicer

0.6%
6k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 9 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
3 more.

risingwave by risingwavelabs

0.2%
9k
Stream processing and serving for AI agents and real-time data applications
Created 4 years ago
Updated 5 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Joe Walnes Joe Walnes(Head of Experimental Projects at Stripe), and
9 more.

3FS by deepseek-ai

0.2%
10k
Distributed file system for AI training/inference workloads
Created 1 year ago
Updated 4 weeks ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
26 more.

datasets by huggingface

0.2%
21k
Access and process large AI datasets efficiently
Created 6 years ago
Updated 22 hours ago
Feedback? Help us improve.