Back to Projects

SparkQuery

Project Motivation & Problem Statement

Querying and analyzing large datasets is a core skill in data engineering and analytics. While traditional SQL databases like PostgreSQL excel at transactional workloads, they struggle with large-scale analytical queries. Apache Spark SQL bridges this gap by enabling SQL-like queries over distributed datasets. SparkQuery explores both sides of this spectrum-ingesting real-time RSS feeds into PostgreSQL via Spark and leveraging Spark SQL for distributed analytical processing-building practical fluency in the query engines that power modern data platforms.

Technical Approach

1. Real-Time RSS Feed Ingestion

Built a data pipeline that fetches live news from Google News RSS feeds and persists to PostgreSQL.

  • RSS Parsing: Used both xml.etree.ElementTree and feedparser libraries to parse Google News RSS feeds, extracting title, link, pubDate, description, and source fields.
  • Multi-Category Feeds: Configured feeds for three news categories-technology, business, and sports-using Google News search API endpoints.
  • Spark DataFrame Creation: Converted parsed RSS data into Spark DataFrames using spark.createDataFrame() with Row objects for schema flexibility.
  • Category Tagging: Added category metadata to each record enabling filtered analytics downstream.

2. PostgreSQL Integration via JDBC

  • JDBC Connection: Configured Spark-PostgreSQL connectivity using org.postgresql.Driver with connection URL jdbc:postgresql://localhost:5432/postgres.
  • Batch Writes: Used DataFrame.write.jdbc() with mode='append' to incrementally add new records to the news.q1 table.
  • Direct SQL Execution: Implemented custom delete operations using java.sql.DriverManager for record filtering (e.g., removing NFL-related entries).
  • Read-Back Validation: Verified data persistence by reading back from PostgreSQL using spark.read.jdbc().

3. Time-Based Data Filtering

  • Timestamp Parsing: Converted RSS pubDate strings to Spark timestamps using to_timestamp() with format pattern "E, dd MMM yyyy HH:mm:ss Z".
  • Recency Filtering: Implemented 24-hour filtering using F.expr('INTERVAL 24 HOURS') to extract only recent news articles.
  • Legacy Time Parser: Configured spark.sql.legacy.timeParserPolicy=LEGACY to handle RFC 2822 date formats from RSS feeds.

4. Spark SQL Analytics

  • Distinct Queries: Used Spark DataFrame operations like .select('category').distinct() for unique value extraction.
  • Session Configuration: Configured SparkSession with local[*] master for multi-core parallel processing and extended network timeout (600s).
  • SQL Functions: Leveraged pyspark.sql.functions (aliased as F) for column operations including current_timestamp(), col(), and to_timestamp().

Implementation Details

  • Spark Initialization: Used findspark.init() for Spark discovery and configured driver host as 127.0.0.1 for local development.
  • Dual Context Setup: Created both SparkContext and SQLContext for RDD and DataFrame operations.
  • Error Handling: Implemented try/finally blocks for JDBC connections ensuring proper resource cleanup.
  • Schema Design: Designed news table with columns: title, link, pubDate, description, source/category.

Results

  • Successfully ingested real-time news from multiple Google News RSS feeds into PostgreSQL.
  • Implemented reliable JDBC-based data persistence with Spark DataFrame API.
  • Time-based filtering accurately isolates recent news within configurable time windows.
  • Category-based analytics enable news segmentation across technology, business, and sports domains.
  • Comprehensive documentation with numbered screenshots for each problem solution.

Limitations

  • RSS feed structure varies across sources; parser robustness required for production use.
  • Single-node Spark mode used; distributed cluster would improve throughput for high-volume feeds.
  • PostgreSQL credentials stored in code; environment-based configuration recommended for production.

Skills and Technologies Demonstrated

  • Apache Spark SQL and PySpark DataFrames
  • PostgreSQL JDBC integration
  • RSS feed parsing (feedparser, xml.etree.ElementTree)
  • Timestamp parsing and time-based filtering
  • Spark session configuration and context management
  • Real-time data ingestion pipelines
  • SQL operations via Spark DataFrame API

Resources