Buscalicor Scraper Network
This project is a modular, extensible Python web scraper network for extracting product data from multiple Spanish-language e-commerce sites using headless browsers.
Features
- Multi-site, multi-product scraping
- JavaScript rendering with Playwright (Puppeteer/context7 MCP support planned)
- Configurable extraction via YAML files per site
- Modular design: core engine, per-site logic, config loader, browser manager
- Schema validation (Pydantic)
- Deduplication and rate limiting
- Extensible for new sites and selectors
- Robust error handling and logging
- Unit and integration tests
Directory Structure
buscalicor/
├── main.py
├── base_scraper.py
├── scrapers/
│ ├── laeuropea_scraper.py
│ ├── ribasmith_scraper.py
│ └── thewinery_scraper.py
├── core/
│ ├── engine.py
│ ├── browser.py
│ └── config_loader.py
├── configs/
│ ├── laeuropea.yaml
│ ├── ribasmith.yaml
│ └── thewinery.yaml
├── tests/
│ ├── test_schema_validation.py
│ └── ...
├── data/
├── logs/
├── requirements.txt
└── README.md
Usage
- Install dependencies:
pip install -r requirements.txt playwright install - Run a scraper (example):
python main.py laeuropea --max-pages 3 - Output will be saved in
data/and logs inlogs/.
Adding a New Site
- Create a YAML config in
configs/for the new site. - Optionally, add a new scraper class in
scrapers/if custom logic is needed. - Update CLI or engine as needed.
Testing
Run all tests with:
pytest tests/
Security & Best Practices
- No credentials are hardcoded; use environment variables for secrets.
- All extracted data is validated and sanitized.
- Always respect robots.txt and site terms of service.
Roadmap
- Support for Puppeteer/context7 MCP
- Database backend (MongoDB/PostgreSQL)
- Monitoring and alerting
- Advanced anti-bot and CAPTCHA handling
Description
Languages
Python
86.8%
PowerShell
10.7%
C
2%
Batchfile
0.5%