Tencent2020_Rank1st  by guoday

Algorithm code for predicting user age/gender from ad click history

created 5 years ago
1,067 stars

Top 36.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository contains the code for the 2020 Tencent College Algorithm Contest, which achieved 1st place. It addresses the problem of predicting user demographics (age and gender) based on their historical ad click data. The solution is designed for participants in similar data science competitions.

How It Works

The approach combines Word2Vector embeddings with a pre-trained BERT model. User click history is processed to extract features, which are then fed into a model that leverages both word embeddings for sequential data and BERT for contextual understanding of ad interactions. This hybrid approach aims to capture complex user behavior patterns for accurate demographic prediction.

Quick Start & Requirements

  • Install: pip install transformers==2.8.0 pandas gensim scikit-learn filelock gdown
  • Prerequisites: Linux Ubuntu 16.04, 256GB RAM, 4x P100 GPUs.
  • Data Download: Requires downloading a dataset from a provided Google Drive link and unzipping it into a data directory.
  • Pre-trained Models: Requires downloading pre-trained Word2Vector and BERT models (available in small, base, large, and XL variants) and placing them in the data or BERT directories respectively.
  • Full Process: Execute bash run.sh to run the entire pipeline.

Highlighted Details

  • Achieved 1st place in the 2020 Tencent College Algorithm Contest.
  • Supports multiple pre-trained BERT model sizes (small, base, large, XL) for experimentation.
  • Includes scripts for data preprocessing, feature extraction, Word2Vector pre-training, BERT pre-training, model training (with k-fold cross-validation), and submission merging.
  • Offers suggestions for running on lower-resource configurations by using only initial competition data or smaller BERT models.

Maintenance & Community

The repository is maintained by guoday. No specific community channels or roadmap are indicated in the README.

Licensing & Compatibility

The README does not explicitly state a license. The code is provided for the Tencent competition, and commercial use or linking with closed-source projects may require clarification.

Limitations & Caveats

The setup requires specific hardware (Ubuntu 16.04, 256GB RAM, 4x P100 GPUs) and older library versions (transformers==2.8.0), which may pose challenges for modern environments. The pre-training steps for BERT are resource-intensive.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
24 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.