2025-04-26 00:01:10 +03:00
2025-04-27 07:42:09 +03:00
2025-04-30 12:57:27 +00:00
2025-04-30 04:35:57 +00:00
2025-04-27 07:42:09 +03:00
2025-04-30 12:13:44 +00:00
2025-04-30 04:35:57 +00:00
2025-04-26 00:01:10 +03:00
2025-04-26 00:01:10 +03:00
2025-04-27 07:42:09 +03:00
2025-04-27 07:42:09 +03:00
2025-04-30 12:13:44 +00:00

Buscalicor Scraper Network

This project is a modular, extensible Python web scraper network for extracting product data from multiple Spanish-language e-commerce sites using headless browsers.

Features

  • Multi-site, multi-product scraping
  • JavaScript rendering with Playwright (Puppeteer/context7 MCP support planned)
  • Configurable extraction via YAML files per site
  • Modular design: core engine, per-site logic, config loader, browser manager
  • Schema validation (Pydantic)
  • Deduplication and rate limiting
  • Extensible for new sites and selectors
  • Robust error handling and logging
  • Unit and integration tests

Directory Structure

buscalicor/
├── main.py
├── base_scraper.py
├── scrapers/
│   ├── laeuropea_scraper.py
│   ├── ribasmith_scraper.py
│   └── thewinery_scraper.py
├── core/
│   ├── engine.py
│   ├── browser.py
│   └── config_loader.py
├── configs/
│   ├── laeuropea.yaml
│   ├── ribasmith.yaml
│   └── thewinery.yaml
├── tests/
│   ├── test_schema_validation.py
│   └── ...
├── data/
├── logs/
├── requirements.txt
└── README.md

Usage

  1. Install dependencies:
    pip install -r requirements.txt
    playwright install
    
  2. Run a scraper (example):
    python main.py laeuropea --max-pages 3
    
  3. Output will be saved in data/ and logs in logs/.

Adding a New Site

  • Create a YAML config in configs/ for the new site.
  • Optionally, add a new scraper class in scrapers/ if custom logic is needed.
  • Update CLI or engine as needed.

Testing

Run all tests with:

pytest tests/

Security & Best Practices

  • No credentials are hardcoded; use environment variables for secrets.
  • All extracted data is validated and sanitized.
  • Always respect robots.txt and site terms of service.

Roadmap

  • Support for Puppeteer/context7 MCP
  • Database backend (MongoDB/PostgreSQL)
  • Monitoring and alerting
  • Advanced anti-bot and CAPTCHA handling
Description
No description provided
Readme 1,000 KiB
Languages
Python 86.8%
PowerShell 10.7%
C 2%
Batchfile 0.5%