balance by facebookresearch

Balance biased data samples for accurate inference

Created 3 years ago

750 stars

Top 45.5% on SourcePulse

View on GitHub

2 Experts Love This Project

David Cournapeau

Author of scikit-learn

Luis Capelo

Cofounder of Lightning AI

Project Summary

Summary

The balance Python package offers a straightforward workflow and methods for addressing biased data samples, particularly relevant for survey statistics and observational studies. It enables users to infer from non-representative samples to a target population by mitigating non-response and sampling biases using auxiliary information. This benefits researchers and data scientists who need to correct for selection bias in their data, improving the reliability of their inferences.

How It Works

balance operates by fitting and evaluating weights for each sample unit, where a weight signifies the number of target population individuals a sample respondent represents. The core workflow involves loading sample and population data, diagnosing covariate distributions, adjusting the sample to match population characteristics using methods like Inverse Probability Weighting (IPW) under the Missing At Random (MAR) assumption, and evaluating the effectiveness of the adjustment through various diagnostics.

Quick Start & Requirements

Installation: Recommended via PyPI: python -m pip install balance. Install from source with python -m pip install git+https://github.com/facebookresearch/balance.git.
Python Version: 3.9 - 3.14.
Operating System: Linux, OSX, Windows.
Key Dependencies: NumPy, Pandas, SciPy, Scikit-learn (versioned based on Python version), IPython, Patsy, Seaborn, Plotly, Matplotlib, Statsmodels, session-info.
Documentation: General Framework, Pre-Adjustment Diagnostics, Adjusting Sample to Population, Evaluating and using the adjustment weights, tutorials.

Highlighted Details

Adjustment Methods: Supports Logistic Regression (L1/LASSO), Covariate Balancing Propensity Score (CBPS), Post-stratification, and Raking.
Diagnostic Tools: Comprehensive evaluation includes plots (density, QQ), statistical summaries, weight distributions, Kish's design effect, and Absolute Standardized Mean Difference (ASMD) for continuous and categorical variables.
Status: Currently in beta, actively supported.
Research Link: DOI: 10.48550/arXiv.2307.06024.

Maintenance & Community

The package is actively maintained by Facebook Research's Central Applied Science team and key contributors like Tal Sarig and Tal Galili. Support, bug reports, and feature suggestions are handled via GitHub issues.

Licensing & Compatibility

Licensed under the permissive MIT license, allowing for broad compatibility with commercial use and closed-source projects. Documentation is under CC-BY.

Limitations & Caveats

The package is currently in beta, indicating potential for ongoing changes. Its effectiveness relies on the Missing At Random (MAR) assumption for bias correction.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days