DupZero

Project Motivation & Problem Statement

Emergency and public safety alert systems generate massive volumes of notifications from multiple sources-news feeds, government agencies, social media, and IoT sensors. A significant challenge is that many of these alerts describe the same underlying event, leading to duplicate and near-duplicate entries that overwhelm operators and delay critical response times. DupZero was built to solve this problem: a production-grade alert processing system that intelligently deduplicates incoming alerts using modern NLP and ML techniques.

The core question this project addresses is: How do we accurately identify and eliminate duplicate alerts when they arrive from diverse sources with different wording, formatting, and metadata?

Technical Approach

1. Web Scraping & Alert Ingestion

Built automated web scrapers to continuously ingest alerts from multiple public safety and emergency notification sources.
Implemented robust parsers to handle varied HTML structures, pagination, and rate limiting across different source websites.
Stored raw alert data with full metadata (timestamp, source, location, category) for downstream processing.

2. Alert Classification Pipeline

Developed a multi-label classification system using Hugging Face Transformers to categorize alerts by type (e.g., weather, fire, security, infrastructure).
Implemented FastAPI route handlers exposing classification endpoints for real-time and batch processing.
Used scikit-learn for baseline classifiers and ensemble methods during model development and evaluation.

3. Embedding Generation & Semantic Similarity

Built an embedding pipeline using pre-trained transformer models to convert alert text into dense vector representations.
Computed cosine similarity and distance metrics between alert embeddings to identify semantically similar entries.
Leveraged PyTorch for efficient batch embedding computation on GPU when available.

4. Multi-Stage Deduplication

Stage 1 - Exact Match: Hash-based deduplication to instantly eliminate identical alerts.
Stage 2 - Fuzzy Match: Text similarity scoring using token overlap and edit distance for near-duplicates.
Stage 3 - Semantic Match: Embedding-based similarity to catch paraphrased or differently-worded duplicates.
Configurable threshold parameters at each stage to tune precision vs. recall based on operational requirements.

5. API Architecture & Deployment

Structured the backend with FastAPI, exposing endpoints for classification (/classify), deduplication (/deduplicate), and scraper management (/scrape).
Containerized the entire system with Docker for reproducible deployments across environments.
Set up GitHub Actions CI/CD pipeline for automated testing, building, and deployment on every push.
Integrated observability tooling for monitoring pipeline health, throughput, and deduplication metrics.

Results

Achieved end-to-end alert ingestion, classification, and deduplication in a single automated pipeline.
Multi-stage deduplication significantly reduced duplicate alerts compared to single-method approaches.
Semantic embedding stage caught paraphrased duplicates that exact and fuzzy matching missed entirely.
Containerized deployment enabled consistent operation across development, staging, and production environments.

Limitations

Embedding model performance depends on the domain specificity of the pre-trained model; fine-tuning on alert-specific corpora could improve accuracy.
Web scraping is inherently fragile-source website changes can break parsers and require maintenance.
Real-time deduplication at very high throughput may require distributed processing beyond a single container.

Skills and Technologies Demonstrated

Production API design with FastAPI
NLP and text embedding with Hugging Face Transformers and PyTorch
Multi-stage deduplication algorithm design
Web scraping and data ingestion automation
Docker containerization and GitHub Actions CI/CD
Observability and metrics instrumentation
End-to-end ML pipeline development