qsv  by dathere

CLI tool for blazing-fast CSV data-wrangling

Created 4 years ago
3,156 stars

Top 15.3% on SourcePulse

GitHubView on GitHub
Project Summary

qsv is a command-line data wrangling toolkit designed for blazing-fast processing of tabular data. It offers a comprehensive suite of commands for querying, transforming, analyzing, and validating CSV and other file formats, targeting data analysts and engineers who need efficient data manipulation capabilities.

How It Works

qsv is built in Rust, prioritizing speed and memory efficiency. It leverages multithreading extensively, especially when an index is available, and employs streaming algorithms for most operations to handle arbitrarily large files. Key features include an optional indexing mechanism for constant-time random access, support for various data formats beyond CSV (like Parquet, JSON, Excel), and integration with Luau and Python for complex data pipelines.

Quick Start & Requirements

  • Installation: Prebuilt binaries are available for Linux, macOS, and Windows. Installation via package managers (Homebrew, Scoop, Nixpkgs, etc.) or from source using cargo install qsv --locked --features all_features is also supported.
  • Prerequisites: Rust toolchain for source builds. Optional Python 3.8+ for the py command.
  • Resources: CPU optimizations (SSE4.2, AVX2, AVX512, ARM64 NEON) are enabled in prebuilt binaries, potentially limiting compatibility with older CPUs. Portable variants are available.
  • Documentation: qsv.dathere.com

Highlighted Details

  • Performance: Claims up to 360,000 geocodes/sec and 780,000 rows/sec for validation. Benchmarks are available at qsv.dathere.com/benchmarks.
  • Indexing: Creates a persistent index for constant-time random access and accelerated operations.
  • Extensibility: Supports Luau and Python scripting for custom data transformations and pipelines.
  • Format Support: Handles CSV, TSV, SSV, JSON, JSONL, Parquet, Arrow IPC, Avro, Excel (.xls, .xlsx, .ods), PostgreSQL, and SQLite.

Maintenance & Community

The project is sponsored by datHere. Community interaction is facilitated through GitHub discussions.

Licensing & Compatibility

Dual-licensed under MIT or the UNLICENSE, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

Some commands (marked with 🤯) load the entire CSV into memory, though external variants exist. The luau feature may not be available in musl prebuilt binaries and requires compilation from source on a musl-based distro. CPU optimizations in prebuilt binaries may cause issues on older CPUs.

Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
67
Issues (30d)
31
Star History
99 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.