qsv  by dathere

CLI tool for blazing-fast CSV data-wrangling

created 4 years ago
3,034 stars

Top 16.1% on sourcepulse

GitHubView on GitHub
Project Summary

qsv is a command-line data wrangling toolkit designed for blazing-fast processing of tabular data. It offers a comprehensive suite of commands for querying, transforming, analyzing, and validating CSV and other file formats, targeting data analysts and engineers who need efficient data manipulation capabilities.

How It Works

qsv is built in Rust, prioritizing speed and memory efficiency. It leverages multithreading extensively, especially when an index is available, and employs streaming algorithms for most operations to handle arbitrarily large files. Key features include an optional indexing mechanism for constant-time random access, support for various data formats beyond CSV (like Parquet, JSON, Excel), and integration with Luau and Python for complex data pipelines.

Quick Start & Requirements

  • Installation: Prebuilt binaries are available for Linux, macOS, and Windows. Installation via package managers (Homebrew, Scoop, Nixpkgs, etc.) or from source using cargo install qsv --locked --features all_features is also supported.
  • Prerequisites: Rust toolchain for source builds. Optional Python 3.8+ for the py command.
  • Resources: CPU optimizations (SSE4.2, AVX2, AVX512, ARM64 NEON) are enabled in prebuilt binaries, potentially limiting compatibility with older CPUs. Portable variants are available.
  • Documentation: qsv.dathere.com

Highlighted Details

  • Performance: Claims up to 360,000 geocodes/sec and 780,000 rows/sec for validation. Benchmarks are available at qsv.dathere.com/benchmarks.
  • Indexing: Creates a persistent index for constant-time random access and accelerated operations.
  • Extensibility: Supports Luau and Python scripting for custom data transformations and pipelines.
  • Format Support: Handles CSV, TSV, SSV, JSON, JSONL, Parquet, Arrow IPC, Avro, Excel (.xls, .xlsx, .ods), PostgreSQL, and SQLite.

Maintenance & Community

The project is sponsored by datHere. Community interaction is facilitated through GitHub discussions.

Licensing & Compatibility

Dual-licensed under MIT or the UNLICENSE, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

Some commands (marked with 🤯) load the entire CSV into memory, though external variants exist. The luau feature may not be available in musl prebuilt binaries and requires compilation from source on a musl-based distro. CPU optimizations in prebuilt binaries may cause issues on older CPUs.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
38
Issues (30d)
8
Star History
268 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
12 more.

redis by redis

0.1%
70k
Redis is a versatile data structure server, cache, and query engine
created 16 years ago
updated 2 days ago
Feedback? Help us improve.