Back to Projects

StreamFlow

Project Motivation & Problem Statement

Real-time data processing is essential in domains like fraud detection, recommendation engines, and IoT monitoring, where insights must be generated within seconds of data arrival. Combining deep learning frameworks like TensorFlow with streaming platforms like Apache Kafka creates powerful architectures for real-time ML inference. StreamFlow explores this intersection-building TensorFlow neural networks for multi-class classification and integrating Kafka-based streaming pipelines for real-time YouTube analytics, demonstrating the techniques behind modern streaming ML systems.

Technical Approach

1. TensorFlow Neural Network for Intrusion Detection (Q1)

Built a multi-class classifier using TensorFlow/Keras on the NSL-KDD network intrusion dataset.

  • Data Preprocessing: Reused Spark-based feature engineering pipeline from SparkDrive, then converted to TensorFlow tensors using tf.constant().
  • Model Architecture:
    Dense(64, activation='relu', input_shape=(n_features,))
    Dense(64, activation='relu')
    Dense(32, activation='relu')
    Dense(5, activation='softmax')  # 5 attack categories
  • Training Configuration:
    • Optimizer: Adam(learning_rate=0.001)
    • Loss: SparseCategoricalCrossentropy(from_logits=False)
    • Metrics: SparseCategoricalAccuracy
    • Epochs: 30 with batch_size=64
  • Data Split: Training set from KDDTrain+.txt; validation and test sets from 50/50 random split of KDDTest+.txt.
  • TensorBoard Logging: Configured callbacks for training visualization with timestamped log directories.

2. Model Evaluation

  • Test Metrics: Evaluated using model.evaluate() reporting test loss and accuracy.
  • Predictions: Generated class predictions using tf.argmax(y_pred, axis=1) on softmax outputs.
  • Classification Report: Used sklearn.metrics.classification_report for per-class precision, recall, and F1-score.

3. Kafka Streaming for YouTube Analytics (Q2)

Built a real-time streaming pipeline to analyze YouTube video engagement using Kafka.

  • Producer (producer.py):
    • Fetches video metadata using YouTube Data API v3 via googleapiclient.discovery.build().
    • Retrieves comment like counts using commentThreads().list() API endpoint.
    • Publishes video title and total likes to Kafka topic topic_yt.
    • Uses confluent_kafka.Producer with bootstrap.servers='localhost:9092'.
  • Consumer (consumer.py):
    • Subscribes to Kafka topic and aggregates likes per video in real-time.
    • Configured with group.id='analytics' and auto.offset.reset='earliest'.
    • Decodes message keys (video titles) and values (like counts) from UTF-8.
    • Reports the most-liked video after consumer termination.

4. End-to-End System Integration

  • Input: User provides up to 5 YouTube video IDs via command-line input.
  • Processing: Producer fetches data from YouTube API, streams to Kafka, consumer aggregates in real-time.
  • Output: Running totals displayed per video; final summary shows highest engagement video.
  • Error Handling: Graceful handling of disabled comments and private videos with informative messages.

Implementation Details

  • GCP Credentials: Service account JSON for YouTube API authentication.
  • Kafka Broker: Local Kafka instance at localhost:9092.
  • Spark-TensorFlow Bridge: Converted Spark DataFrames to pandas, then to TensorFlow tensors for model training.
  • Feature Conversion: Used custom UDF to convert Spark ML vectors to arrays for TensorFlow compatibility.

Results

  • TensorFlow model successfully classifies network intrusion types across 5 categories.
  • Kafka pipeline demonstrates real-time message passing with producer/consumer pattern.
  • YouTube engagement analytics aggregated across multiple videos in streaming fashion.
  • Comprehensive documentation with numbered screenshots (1.2.png, 1.3.png, 2.2.png, 2.3.png).

Limitations

  • Kafka runs in standalone mode; production requires multi-broker cluster for fault tolerance.
  • YouTube API has quota limits; high-volume usage requires quota management.
  • Model trained on NSL-KDD; real network data may have different feature distributions.
  • Consumer requires manual termination; production systems need graceful shutdown mechanisms.

Skills and Technologies Demonstrated

  • TensorFlow/Keras neural network design
  • Multi-class classification with softmax output
  • Apache Kafka producer/consumer architecture
  • confluent-kafka Python client
  • YouTube Data API v3 integration
  • Real-time streaming data pipelines
  • TensorBoard training visualization
  • Spark-to-TensorFlow data conversion

Resources