A distributed, scalable web scraping platform for Fortune 15 renewable energy company's market research. Automated data collection across 41,000+ ZIP codes with intelligent job scheduling and real-time processing.
During the 2025 Winter Cohort (January - May 2025), CodeLab UC Davis developed a distributed geospatial data scraper for a Fortune 15 renewable energy company. This 8-week project created a scalable platform that enables market researchers to identify potential locations for expanding RNG (renewable natural gas) station portfolios across the continental United States.
Traditional market research required manual data collection across fragmented sources, leading to extremely long turnaround times, inaccurate data, and expensive hybrid SaaS solutions that were both costly and inefficient.
Built a distributed scraping infrastructure with Puppeteer-based web scrapers, RabbitMQ message queuing, FastAPI backend, and Next.js frontend. Features advanced anti-bot evasion techniques and intelligent job scheduling across 41,000+ ZIP codes.
Research that previously took hours is now completed in seconds. Custom validation logic eliminates errors at the source, and data is delivered in clean, Excel-ready formats tailored to existing workflows with significantly reduced operational costs.
Home dashboard providing users with access to recent activity, favorite jobs, and real-time status monitoring for currently running scraping instances.
Maps Scraper configuration page allowing complete control of scraper scope. Users can target specific companies or run queries on all business types within selected geographic regions.
Data preview interface displaying scraped business information with name, latitude, longitude, address, and category data in Excel-formatted tables for easy verification before download.
Complete system architecture diagram illustrating the distributed scraping infrastructure from job scheduling through RabbitMQ message broker to Puppeteer web scrapers and data management services.
Built Next.js web application with TanStack Table for advanced data presentation and Tailwind CSS for responsive styling. Communicates with FastAPI backend through RESTful HTTP/HTTPS requests for dynamic, data-driven interfaces.
Designed for efficient, scalable data collection with scheduling service generating job tasks passed to RabbitMQ message broker. Puppeteer-based web scrapers consume tasks in parallel across multiple nodes, dramatically reducing scraping time compared to sequential methods.
FastAPI backend service handles all HTTP/HTTPS requests, processes user inputs, queries PostgreSQL database via ORM, and triggers new scraping jobs. ZIP Data Management Service processes incoming data with pandas-based validation and Excel export optimization.
RabbitMQ-powered job distribution across multiple scraper nodes with intelligent load balancing, automatic failover, and real-time job redistribution to maintain efficiency and prevent slowdowns.
Sophisticated measures including headless detection evasion, randomized behavior patterns, user-agent spoofing, and randomized input timing to significantly increase scrape success rates across protected platforms.
Cloud-native design prepared for Azure Kubernetes Service deployment with containerized Puppeteer scrapers, auto-scaling capabilities, and seamless integration with existing infrastructure workflows.
Pandas-based processing with business rule validation, duplicate resolution, and structured Excel export formatting. Users receive accurate, analysis-ready files with no additional cleanup required.
This 8-week distributed systems project showcased advanced full-stack development, enterprise integration, and scalable cloud architecture capabilities for Fortune 15 clients.