balance  by facebookresearch

Balance biased data samples for accurate inference

Created 3 years ago
741 stars

Top 46.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

The balance Python package offers a straightforward workflow and methods for addressing biased data samples, particularly relevant for survey statistics and observational studies. It enables users to infer from non-representative samples to a target population by mitigating non-response and sampling biases using auxiliary information. This benefits researchers and data scientists who need to correct for selection bias in their data, improving the reliability of their inferences.

How It Works

balance operates by fitting and evaluating weights for each sample unit, where a weight signifies the number of target population individuals a sample respondent represents. The core workflow involves loading sample and population data, diagnosing covariate distributions, adjusting the sample to match population characteristics using methods like Inverse Probability Weighting (IPW) under the Missing At Random (MAR) assumption, and evaluating the effectiveness of the adjustment through various diagnostics.

Quick Start & Requirements

  • Installation: Recommended via PyPI: python -m pip install balance. Install from source with python -m pip install git+https://github.com/facebookresearch/balance.git.
  • Python Version: 3.9 - 3.14.
  • Operating System: Linux, OSX, Windows.
  • Key Dependencies: NumPy, Pandas, SciPy, Scikit-learn (versioned based on Python version), IPython, Patsy, Seaborn, Plotly, Matplotlib, Statsmodels, session-info.
  • Documentation: General Framework, Pre-Adjustment Diagnostics, Adjusting Sample to Population, Evaluating and using the adjustment weights, tutorials.

Highlighted Details

  • Adjustment Methods: Supports Logistic Regression (L1/LASSO), Covariate Balancing Propensity Score (CBPS), Post-stratification, and Raking.
  • Diagnostic Tools: Comprehensive evaluation includes plots (density, QQ), statistical summaries, weight distributions, Kish's design effect, and Absolute Standardized Mean Difference (ASMD) for continuous and categorical variables.
  • Status: Currently in beta, actively supported.
  • Research Link: DOI: 10.48550/arXiv.2307.06024.

Maintenance & Community

The package is actively maintained by Facebook Research's Central Applied Science team and key contributors like Tal Sarig and Tal Galili. Support, bug reports, and feature suggestions are handled via GitHub issues.

Licensing & Compatibility

Licensed under the permissive MIT license, allowing for broad compatibility with commercial use and closed-source projects. Documentation is under CC-BY.

Limitations & Caveats

The package is currently in beta, indicating potential for ongoing changes. Its effectiveness relies on the Missing At Random (MAR) assumption for bias correction.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
37
Issues (30d)
3
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alex Atallah Alex Atallah(Cofounder of OpenRouter, OpenSea), and
8 more.

cleanlab by cleanlab

0.1%
11k
Data-centric AI package for ML with messy data
Created 8 years ago
Updated 2 months ago
Feedback? Help us improve.