Project Motivation & Problem Statement
Querying and analyzing large datasets is a core skill in data engineering and analytics. While traditional SQL databases like PostgreSQL excel at transactional workloads, they struggle with large-scale analytical queries. Apache Spark SQL bridges this gap by enabling SQL-like queries over distributed datasets. SparkQuery explores both sides of this spectrum-ingesting real-time RSS feeds into PostgreSQL via Spark and leveraging Spark SQL for distributed analytical processing-building practical fluency in the query engines that power modern data platforms.
Technical Approach
1. Real-Time RSS Feed Ingestion
Built a data pipeline that fetches live news from Google News RSS feeds and persists to PostgreSQL.
- RSS Parsing: Used both
xml.etree.ElementTree and feedparser libraries to parse Google News RSS feeds, extracting title, link, pubDate, description, and source fields.
- Multi-Category Feeds: Configured feeds for three news categories-technology, business, and sports-using Google News search API endpoints.
- Spark DataFrame Creation: Converted parsed RSS data into Spark DataFrames using
spark.createDataFrame() with Row objects for schema flexibility.
- Category Tagging: Added category metadata to each record enabling filtered analytics downstream.
2. PostgreSQL Integration via JDBC
- JDBC Connection: Configured Spark-PostgreSQL connectivity using
org.postgresql.Driver with connection URL jdbc:postgresql://localhost:5432/postgres.
- Batch Writes: Used
DataFrame.write.jdbc() with mode='append' to incrementally add new records to the news.q1 table.
- Direct SQL Execution: Implemented custom delete operations using
java.sql.DriverManager for record filtering (e.g., removing NFL-related entries).
- Read-Back Validation: Verified data persistence by reading back from PostgreSQL using
spark.read.jdbc().
3. Time-Based Data Filtering
- Timestamp Parsing: Converted RSS pubDate strings to Spark timestamps using
to_timestamp() with format pattern "E, dd MMM yyyy HH:mm:ss Z".
- Recency Filtering: Implemented 24-hour filtering using
F.expr('INTERVAL 24 HOURS') to extract only recent news articles.
- Legacy Time Parser: Configured
spark.sql.legacy.timeParserPolicy=LEGACY to handle RFC 2822 date formats from RSS feeds.
4. Spark SQL Analytics
- Distinct Queries: Used Spark DataFrame operations like
.select('category').distinct() for unique value extraction.
- Session Configuration: Configured SparkSession with
local[*] master for multi-core parallel processing and extended network timeout (600s).
- SQL Functions: Leveraged
pyspark.sql.functions (aliased as F) for column operations including current_timestamp(), col(), and to_timestamp().
Implementation Details
- Spark Initialization: Used
findspark.init() for Spark discovery and configured driver host as 127.0.0.1 for local development.
- Dual Context Setup: Created both
SparkContext and SQLContext for RDD and DataFrame operations.
- Error Handling: Implemented try/finally blocks for JDBC connections ensuring proper resource cleanup.
- Schema Design: Designed news table with columns: title, link, pubDate, description, source/category.
Results
- Successfully ingested real-time news from multiple Google News RSS feeds into PostgreSQL.
- Implemented reliable JDBC-based data persistence with Spark DataFrame API.
- Time-based filtering accurately isolates recent news within configurable time windows.
- Category-based analytics enable news segmentation across technology, business, and sports domains.
- Comprehensive documentation with numbered screenshots for each problem solution.
Limitations
- RSS feed structure varies across sources; parser robustness required for production use.
- Single-node Spark mode used; distributed cluster would improve throughput for high-volume feeds.
- PostgreSQL credentials stored in code; environment-based configuration recommended for production.
Skills and Technologies Demonstrated
- Apache Spark SQL and PySpark DataFrames
- PostgreSQL JDBC integration
- RSS feed parsing (feedparser, xml.etree.ElementTree)
- Timestamp parsing and time-based filtering
- Spark session configuration and context management
- Real-time data ingestion pipelines
- SQL operations via Spark DataFrame API