SparkQuery

Project Motivation & Problem Statement

Querying and analyzing large datasets is a core skill in data engineering and analytics. While traditional SQL databases like PostgreSQL excel at transactional workloads, they struggle with large-scale analytical queries. Apache Spark SQL bridges this gap by enabling SQL-like queries over distributed datasets. SparkQuery explores both sides of this spectrum-ingesting real-time RSS feeds into PostgreSQL via Spark and leveraging Spark SQL for distributed analytical processing-building practical fluency in the query engines that power modern data platforms.

Technical Approach

1. Real-Time RSS Feed Ingestion

Built a data pipeline that fetches live news from Google News RSS feeds and persists to PostgreSQL.

RSS Parsing: Used both xml.etree.ElementTree and feedparser libraries to parse Google News RSS feeds, extracting title, link, pubDate, description, and source fields.
Multi-Category Feeds: Configured feeds for three news categories-technology, business, and sports-using Google News search API endpoints.
Spark DataFrame Creation: Converted parsed RSS data into Spark DataFrames using spark.createDataFrame() with Row objects for schema flexibility.
Category Tagging: Added category metadata to each record enabling filtered analytics downstream.

2. PostgreSQL Integration via JDBC

JDBC Connection: Configured Spark-PostgreSQL connectivity using org.postgresql.Driver with connection URL jdbc:postgresql://localhost:5432/postgres.
Batch Writes: Used DataFrame.write.jdbc() with mode='append' to incrementally add new records to the news.q1 table.
Direct SQL Execution: Implemented custom delete operations using java.sql.DriverManager for record filtering (e.g., removing NFL-related entries).
Read-Back Validation: Verified data persistence by reading back from PostgreSQL using spark.read.jdbc().

3. Time-Based Data Filtering

Timestamp Parsing: Converted RSS pubDate strings to Spark timestamps using to_timestamp() with format pattern "E, dd MMM yyyy HH:mm:ss Z".
Recency Filtering: Implemented 24-hour filtering using F.expr('INTERVAL 24 HOURS') to extract only recent news articles.
Legacy Time Parser: Configured spark.sql.legacy.timeParserPolicy=LEGACY to handle RFC 2822 date formats from RSS feeds.

4. Spark SQL Analytics

Distinct Queries: Used Spark DataFrame operations like .select('category').distinct() for unique value extraction.
Session Configuration: Configured SparkSession with local[*] master for multi-core parallel processing and extended network timeout (600s).
SQL Functions: Leveraged pyspark.sql.functions (aliased as F) for column operations including current_timestamp(), col(), and to_timestamp().

Implementation Details

Spark Initialization: Used findspark.init() for Spark discovery and configured driver host as 127.0.0.1 for local development.
Dual Context Setup: Created both SparkContext and SQLContext for RDD and DataFrame operations.
Error Handling: Implemented try/finally blocks for JDBC connections ensuring proper resource cleanup.
Schema Design: Designed news table with columns: title, link, pubDate, description, source/category.

Results

Successfully ingested real-time news from multiple Google News RSS feeds into PostgreSQL.
Implemented reliable JDBC-based data persistence with Spark DataFrame API.
Time-based filtering accurately isolates recent news within configurable time windows.
Category-based analytics enable news segmentation across technology, business, and sports domains.
Comprehensive documentation with numbered screenshots for each problem solution.

Limitations

RSS feed structure varies across sources; parser robustness required for production use.
Single-node Spark mode used; distributed cluster would improve throughput for high-volume feeds.
PostgreSQL credentials stored in code; environment-based configuration recommended for production.

Skills and Technologies Demonstrated

Apache Spark SQL and PySpark DataFrames
PostgreSQL JDBC integration
RSS feed parsing (feedparser, xml.etree.ElementTree)
Timestamp parsing and time-based filtering
Spark session configuration and context management
Real-time data ingestion pipelines
SQL operations via Spark DataFrame API