web-scraping  by je-suis-tm

Python scripts for web scraping various websites

created 7 years ago
804 stars

Top 44.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive guide and practical Python scripts for web scraping, targeting both financial data from exchanges and alternative data from news outlets. It's designed for individuals looking to learn or implement web scraping techniques, from beginners to advanced users, offering ready-to-use scrapers and explanations of core methodologies.

How It Works

The project covers fundamental web scraping techniques including parsing HTML structures with BeautifulSoup, extracting data from JSON responses, and utilizing regular expressions for pattern matching. It progresses to more advanced topics like handling website sign-ins (including CSRF tokens), integrating with databases (SQLite), and building automated newsletters. The approach emphasizes practical application and problem-solving, such as dealing with dynamic websites and proxy authentication.

Quick Start & Requirements

  • Install with pip install requests beautifulsoup4 pandas
  • Requires Python 3.x.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Covers scraping from diverse sources: Reddit WallStreetBets, CME, US Treasury, MacroTrends, BBC, WSJ, Bloomberg, and more.
  • Explains techniques for handling dynamic content, authentication, and data storage.
  • Includes practical notes on proxy authentication and legal considerations for web scraping.
  • Offers a structured learning path from beginner to advanced scraping methods.

Maintenance & Community

The repository has gained significant popularity, indicating active interest. Specific contributor or community links (like Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The README notes that some older scripts (like CME1) may no longer work due to website changes, highlighting the dynamic nature of web scraping targets. It also mentions that handling CAPTCHAs is outside the scope of the provided examples.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.