Back to Projects

DupZero

Project Motivation & Problem Statement

Emergency and public safety alert systems generate massive volumes of notifications from multiple sources-news feeds, government agencies, social media, and IoT sensors. A significant challenge is that many of these alerts describe the same underlying event, leading to duplicate and near-duplicate entries that overwhelm operators and delay critical response times. DupZero was built to solve this problem: a production-grade alert processing system that intelligently deduplicates incoming alerts using modern NLP and ML techniques.

The core question this project addresses is: How do we accurately identify and eliminate duplicate alerts when they arrive from diverse sources with different wording, formatting, and metadata?

Technical Approach

1. Web Scraping & Alert Ingestion

  • Built automated web scrapers to continuously ingest alerts from multiple public safety and emergency notification sources.
  • Implemented robust parsers to handle varied HTML structures, pagination, and rate limiting across different source websites.
  • Stored raw alert data with full metadata (timestamp, source, location, category) for downstream processing.

2. Alert Classification Pipeline

  • Developed a multi-label classification system using Hugging Face Transformers to categorize alerts by type (e.g., weather, fire, security, infrastructure).
  • Implemented FastAPI route handlers exposing classification endpoints for real-time and batch processing.
  • Used scikit-learn for baseline classifiers and ensemble methods during model development and evaluation.

3. Embedding Generation & Semantic Similarity

  • Built an embedding pipeline using pre-trained transformer models to convert alert text into dense vector representations.
  • Computed cosine similarity and distance metrics between alert embeddings to identify semantically similar entries.
  • Leveraged PyTorch for efficient batch embedding computation on GPU when available.

4. Multi-Stage Deduplication

  • Stage 1 - Exact Match: Hash-based deduplication to instantly eliminate identical alerts.
  • Stage 2 - Fuzzy Match: Text similarity scoring using token overlap and edit distance for near-duplicates.
  • Stage 3 - Semantic Match: Embedding-based similarity to catch paraphrased or differently-worded duplicates.
  • Configurable threshold parameters at each stage to tune precision vs. recall based on operational requirements.

5. API Architecture & Deployment

  • Structured the backend with FastAPI, exposing endpoints for classification (/classify), deduplication (/deduplicate), and scraper management (/scrape).
  • Containerized the entire system with Docker for reproducible deployments across environments.
  • Set up GitHub Actions CI/CD pipeline for automated testing, building, and deployment on every push.
  • Integrated observability tooling for monitoring pipeline health, throughput, and deduplication metrics.

Results

  • Achieved end-to-end alert ingestion, classification, and deduplication in a single automated pipeline.
  • Multi-stage deduplication significantly reduced duplicate alerts compared to single-method approaches.
  • Semantic embedding stage caught paraphrased duplicates that exact and fuzzy matching missed entirely.
  • Containerized deployment enabled consistent operation across development, staging, and production environments.

Limitations

  • Embedding model performance depends on the domain specificity of the pre-trained model; fine-tuning on alert-specific corpora could improve accuracy.
  • Web scraping is inherently fragile-source website changes can break parsers and require maintenance.
  • Real-time deduplication at very high throughput may require distributed processing beyond a single container.

Skills and Technologies Demonstrated

  • Production API design with FastAPI
  • NLP and text embedding with Hugging Face Transformers and PyTorch
  • Multi-stage deduplication algorithm design
  • Web scraping and data ingestion automation
  • Docker containerization and GitHub Actions CI/CD
  • Observability and metrics instrumentation
  • End-to-end ML pipeline development

Resources