Project Motivation & Problem Statement
Emergency and public safety alert systems generate massive volumes of notifications from multiple sources-news feeds, government agencies, social media, and IoT sensors. A significant challenge is that many of these alerts describe the same underlying event, leading to duplicate and near-duplicate entries that overwhelm operators and delay critical response times. DupZero was built to solve this problem: a production-grade alert processing system that intelligently deduplicates incoming alerts using modern NLP and ML techniques.
The core question this project addresses is: How do we accurately identify and eliminate duplicate alerts when they arrive from diverse sources with different wording, formatting, and metadata?
Technical Approach
1. Web Scraping & Alert Ingestion
- Built automated web scrapers to continuously ingest alerts from multiple public safety and emergency notification sources.
- Implemented robust parsers to handle varied HTML structures, pagination, and rate limiting across different source websites.
- Stored raw alert data with full metadata (timestamp, source, location, category) for downstream processing.
2. Alert Classification Pipeline
- Developed a multi-label classification system using Hugging Face Transformers to categorize alerts by type (e.g., weather, fire, security, infrastructure).
- Implemented FastAPI route handlers exposing classification endpoints for real-time and batch processing.
- Used scikit-learn for baseline classifiers and ensemble methods during model development and evaluation.
3. Embedding Generation & Semantic Similarity
- Built an embedding pipeline using pre-trained transformer models to convert alert text into dense vector representations.
- Computed cosine similarity and distance metrics between alert embeddings to identify semantically similar entries.
- Leveraged PyTorch for efficient batch embedding computation on GPU when available.
4. Multi-Stage Deduplication
- Stage 1 - Exact Match: Hash-based deduplication to instantly eliminate identical alerts.
- Stage 2 - Fuzzy Match: Text similarity scoring using token overlap and edit distance for near-duplicates.
- Stage 3 - Semantic Match: Embedding-based similarity to catch paraphrased or differently-worded duplicates.
- Configurable threshold parameters at each stage to tune precision vs. recall based on operational requirements.
5. API Architecture & Deployment
- Structured the backend with FastAPI, exposing endpoints for classification (
/classify), deduplication (/deduplicate), and scraper management (/scrape).
- Containerized the entire system with Docker for reproducible deployments across environments.
- Set up GitHub Actions CI/CD pipeline for automated testing, building, and deployment on every push.
- Integrated observability tooling for monitoring pipeline health, throughput, and deduplication metrics.
Results
- Achieved end-to-end alert ingestion, classification, and deduplication in a single automated pipeline.
- Multi-stage deduplication significantly reduced duplicate alerts compared to single-method approaches.
- Semantic embedding stage caught paraphrased duplicates that exact and fuzzy matching missed entirely.
- Containerized deployment enabled consistent operation across development, staging, and production environments.
Limitations
- Embedding model performance depends on the domain specificity of the pre-trained model; fine-tuning on alert-specific corpora could improve accuracy.
- Web scraping is inherently fragile-source website changes can break parsers and require maintenance.
- Real-time deduplication at very high throughput may require distributed processing beyond a single container.
Skills and Technologies Demonstrated
- Production API design with FastAPI
- NLP and text embedding with Hugging Face Transformers and PyTorch
- Multi-stage deduplication algorithm design
- Web scraping and data ingestion automation
- Docker containerization and GitHub Actions CI/CD
- Observability and metrics instrumentation
- End-to-end ML pipeline development