tsv-utils by eBay

CLI tools for large tabular data files: filtering, statistics, sampling, joins, and more

Created 9 years ago

1,467 stars

Top 27.7% on SourcePulse

View on GitHub

5 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Joe Walnes

Head of Experimental Projects at Stripe

and 1 more!

Project Summary

This toolkit provides a suite of command-line utilities for efficient manipulation of large, tabular data files, commonly found in data mining and machine learning workflows. It targets data scientists and engineers who need to perform operations like filtering, sampling, statistics, and joins on datasets that are too large for in-memory processing but not yet requiring distributed systems. The primary benefit is significantly faster execution compared to traditional Unix tools or other specialized libraries.

How It Works

The utilities are implemented in the D programming language, leveraging its performance characteristics and metaprogramming capabilities. They are designed as standalone executables that follow Unix pipeline conventions, reading from standard input or files and writing to standard output. Key design choices include UTF-8 support for all operations, Unicode readiness, and the ability to identify fields by name or number, with customizable delimiters. Performance is further enhanced through optional Link Time Optimization (LTO) and Profile Guided Optimization (PGO) during compilation.

Quick Start & Requirements

Installation: Prebuilt binaries are available for Linux and macOS. Alternatively, build from source using D compilers (DMD 2.088.1+ or LDC 1.18.0+) via make or install via DUB (dub fetch tsv-utils).
Prerequisites: A D compiler is required for building from source.
Resources: Building with LTO/PGO is recommended for optimal performance.
Documentation: Tools Reference: https://github.com/eBay/tsv-utils/blob/main/ToolsReference.md, Tips and Tricks: https://github.com/eBay/tsv-utils/blob/main/TipsAndTricks.md

Highlighted Details

Offers significantly faster performance than comparable tools, with benchmarks available.
Supports field identification by name (with header support) and number, with customizable delimiters.
Includes advanced features like Bernoulli and distinct sampling, and hash semi-joins.
tsv-pretty provides aligned, human-readable output for terminal display.

Maintenance & Community

The project is actively maintained by eBay. Release notes and version history are available on the GitHub releases page.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which permits commercial use and linking with closed-source projects.

Limitations & Caveats

tsv-uniq and tsv-join have memory limitations for filter files or unique entries exceeding approximately 10 million lines, after which performance may degrade. Windows users are directed to WSL or Docker for Linux compatibility.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days