CLI tools for large tabular data files: filtering, statistics, sampling, joins, and more
Top 28.9% on sourcepulse
This toolkit provides a suite of command-line utilities for efficient manipulation of large, tabular data files, commonly found in data mining and machine learning workflows. It targets data scientists and engineers who need to perform operations like filtering, sampling, statistics, and joins on datasets that are too large for in-memory processing but not yet requiring distributed systems. The primary benefit is significantly faster execution compared to traditional Unix tools or other specialized libraries.
How It Works
The utilities are implemented in the D programming language, leveraging its performance characteristics and metaprogramming capabilities. They are designed as standalone executables that follow Unix pipeline conventions, reading from standard input or files and writing to standard output. Key design choices include UTF-8 support for all operations, Unicode readiness, and the ability to identify fields by name or number, with customizable delimiters. Performance is further enhanced through optional Link Time Optimization (LTO) and Profile Guided Optimization (PGO) during compilation.
Quick Start & Requirements
make
or install via DUB (dub fetch tsv-utils
).Highlighted Details
tsv-pretty
provides aligned, human-readable output for terminal display.Maintenance & Community
The project is actively maintained by eBay. Release notes and version history are available on the GitHub releases page.
Licensing & Compatibility
The project is licensed under the Apache License 2.0, which permits commercial use and linking with closed-source projects.
Limitations & Caveats
tsv-uniq
and tsv-join
have memory limitations for filter files or unique entries exceeding approximately 10 million lines, after which performance may degrade. Windows users are directed to WSL or Docker for Linux compatibility.
2 years ago
Inactive