This project is a modular, extensible Python web scraper network for extracting product data from multiple Spanish-language e-commerce sites using headless browsers.

Features

Multi-site, multi-product scraping
JavaScript rendering with Playwright (Puppeteer/context7 MCP support planned)
Configurable extraction via YAML files per site
Modular design: core engine, per-site logic, config loader, browser manager
Schema validation (Pydantic)
Deduplication and rate limiting
Extensible for new sites and selectors
Robust error handling and logging
Unit and integration tests

Directory Structure

buscalicor/
├── main.py
├── base_scraper.py
├── scrapers/
│   ├── laeuropea_scraper.py
│   ├── ribasmith_scraper.py
│   └── thewinery_scraper.py
├── core/
│   ├── engine.py
│   ├── browser.py
│   └── config_loader.py
├── configs/
│   ├── laeuropea.yaml
│   ├── ribasmith.yaml
│   └── thewinery.yaml
├── tests/
│   ├── test_schema_validation.py
│   └── ...
├── data/
├── logs/
├── requirements.txt
└── README.md

Usage

Install dependencies:

pip install -r requirements.txt
playwright install

Run a scraper (example):
```
python main.py laeuropea --max-pages 3
```
Output will be saved in data/ and logs in logs/.

Adding a New Site

Create a YAML config in configs/ for the new site.
Optionally, add a new scraper class in scrapers/ if custom logic is needed.
Update CLI or engine as needed.

Testing

Run all tests with:

pytest tests/

Security & Best Practices

No credentials are hardcoded; use environment variables for secrets.
All extracted data is validated and sanitized.
Always respect robots.txt and site terms of service.

Roadmap

Support for Puppeteer/context7 MCP
Database backend (MongoDB/PostgreSQL)
Monitoring and alerting
Advanced anti-bot and CAPTCHA handling