tsv-utils  by eBay

CLI tools for large tabular data files: filtering, statistics, sampling, joins, and more

created 9 years ago
1,447 stars

Top 28.9% on sourcepulse

GitHubView on GitHub
Project Summary

This toolkit provides a suite of command-line utilities for efficient manipulation of large, tabular data files, commonly found in data mining and machine learning workflows. It targets data scientists and engineers who need to perform operations like filtering, sampling, statistics, and joins on datasets that are too large for in-memory processing but not yet requiring distributed systems. The primary benefit is significantly faster execution compared to traditional Unix tools or other specialized libraries.

How It Works

The utilities are implemented in the D programming language, leveraging its performance characteristics and metaprogramming capabilities. They are designed as standalone executables that follow Unix pipeline conventions, reading from standard input or files and writing to standard output. Key design choices include UTF-8 support for all operations, Unicode readiness, and the ability to identify fields by name or number, with customizable delimiters. Performance is further enhanced through optional Link Time Optimization (LTO) and Profile Guided Optimization (PGO) during compilation.

Quick Start & Requirements

Highlighted Details

  • Offers significantly faster performance than comparable tools, with benchmarks available.
  • Supports field identification by name (with header support) and number, with customizable delimiters.
  • Includes advanced features like Bernoulli and distinct sampling, and hash semi-joins.
  • tsv-pretty provides aligned, human-readable output for terminal display.

Maintenance & Community

The project is actively maintained by eBay. Release notes and version history are available on the GitHub releases page.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which permits commercial use and linking with closed-source projects.

Limitations & Caveats

tsv-uniq and tsv-join have memory limitations for filter files or unique entries exceeding approximately 10 million lines, after which performance may degrade. Windows users are directed to WSL or Docker for Linux compatibility.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
12 more.

redis by redis

0.1%
70k
Redis is a versatile data structure server, cache, and query engine
created 16 years ago
updated 1 day ago
Feedback? Help us improve.