Geospatial Data Scraper dashboard interface showing distributed job management
Next.jsFastAPIPuppeteerRabbitMQPostgreSQLDocker

Geospatial Data Scraper

A distributed, scalable web scraping platform for Fortune 15 renewable energy company's market research. Automated data collection across 41,000+ ZIP codes with intelligent job scheduling and real-time processing.

Tech Stack

Next.js
Tailwind CSS
FastAPI
PostgreSQL
Puppeteer
RabbitMQ
Docker
Azure Kubernetes
TanStack Table
Python

Project Overview

During the 2025 Winter Cohort (January - May 2025), CodeLab UC Davis developed a distributed geospatial data scraper for a Fortune 15 renewable energy company. This 8-week project created a scalable platform that enables market researchers to identify potential locations for expanding RNG (renewable natural gas) station portfolios across the continental United States.

The Challenge

Traditional market research required manual data collection across fragmented sources, leading to extremely long turnaround times, inaccurate data, and expensive hybrid SaaS solutions that were both costly and inefficient.

The Solution

Built a distributed scraping infrastructure with Puppeteer-based web scrapers, RabbitMQ message queuing, FastAPI backend, and Next.js frontend. Features advanced anti-bot evasion techniques and intelligent job scheduling across 41,000+ ZIP codes.

The Outcome

Research that previously took hours is now completed in seconds. Custom validation logic eliminates errors at the source, and data is delivered in clean, Excel-ready formats tailored to existing workflows with significantly reduced operational costs.

Project Visuals

Geospatial Data Scraper home dashboard showing current status, favorites, and recent activity

Home dashboard providing users with access to recent activity, favorite jobs, and real-time status monitoring for currently running scraping instances.

Maps Scraper configuration interface with company targeting and geographic scope selection

Maps Scraper configuration page allowing complete control of scraper scope. Users can target specific companies or run queries on all business types within selected geographic regions.

Data preview interface showing scraped business information in Excel-formatted tables

Data preview interface displaying scraped business information with name, latitude, longitude, address, and category data in Excel-formatted tables for easy verification before download.

System architecture diagram showing distributed scraping infrastructure flow

Complete system architecture diagram illustrating the distributed scraping infrastructure from job scheduling through RabbitMQ message broker to Puppeteer web scrapers and data management services.

Development Process

Frontend Architecture

Built Next.js web application with TanStack Table for advanced data presentation and Tailwind CSS for responsive styling. Communicates with FastAPI backend through RESTful HTTP/HTTPS requests for dynamic, data-driven interfaces.

Distributed Scraping Infrastructure

Designed for efficient, scalable data collection with scheduling service generating job tasks passed to RabbitMQ message broker. Puppeteer-based web scrapers consume tasks in parallel across multiple nodes, dramatically reducing scraping time compared to sequential methods.

API and Application Tier

FastAPI backend service handles all HTTP/HTTPS requests, processes user inputs, queries PostgreSQL database via ORM, and triggers new scraping jobs. ZIP Data Management Service processes incoming data with pandas-based validation and Excel export optimization.

Key Features

Distributed Processing

RabbitMQ-powered job distribution across multiple scraper nodes with intelligent load balancing, automatic failover, and real-time job redistribution to maintain efficiency and prevent slowdowns.

Anti-Bot Evasion

Sophisticated measures including headless detection evasion, randomized behavior patterns, user-agent spoofing, and randomized input timing to significantly increase scrape success rates across protected platforms.

Scalable Architecture

Cloud-native design prepared for Azure Kubernetes Service deployment with containerized Puppeteer scrapers, auto-scaling capabilities, and seamless integration with existing infrastructure workflows.

Data Validation & Export

Pandas-based processing with business rule validation, duplicate resolution, and structured Excel export formatting. Users receive accurate, analysis-ready files with no additional cleanup required.

Interested in working together?

This 8-week distributed systems project showcased advanced full-stack development, enterprise integration, and scalable cloud architecture capabilities for Fortune 15 clients.