VectorLens

Project Motivation & Problem Statement

As LLMs become central to intelligent applications, the ability to augment them with external knowledge through vector stores and handle multimodal inputs (text, images, audio) becomes critical. VectorLens explores advanced LLM application patterns-building Retrieval-Augmented Generation (RAG) systems powered by vector databases and extending LLM capabilities to process multimodal inputs, demonstrating the techniques that underpin modern AI products like knowledge assistants, visual Q&A systems, and multimedia search engines.

Technical Approach

1. Lecture-Based RAG System (Q1)

Built a Streamlit-based Q&A application that ingests lecture PDFs and enables semantic search over academic content.

PDF Ingestion Pipeline: Used PyPDF2.PdfReader to extract and normalize text from 6 lecture PDF files, converting to lowercase and removing newlines for consistent processing.
Document Chunking: Implemented RecursiveCharacterTextSplitter with chunk_size=1000 and chunk_overlap=200 to create semantically meaningful segments with context preservation at boundaries.
Vector Store: Integrated ChromaDB with persistent storage (./lecture_chroma_db) and lecture-number metadata tagging for filtered retrieval.
Embedding Model: Used Google Vertex AI's text-embedding-004 model for generating dense vector representations of document chunks.
Dual Retrieval: Implemented both MMR (Maximal Marginal Relevance) with lambda_param=0.9 for diversity-aware retrieval and standard Similarity Search, with side-by-side comparison in the UI.

2. Multilingual Syllabus Analyzer (Q2)

Extended the RAG pipeline with Google Cloud Translation for cross-language document understanding.

Translation Integration: Integrated Google Cloud Translation API v2 for real-time document translation across 4 languages: English, Spanish, French, and German.
Chunk-Level Translation: Processed translations per chunk to handle API limits and maintain document structure across language conversion.
Embedding Generation: Generated embeddings for translated chunks enabling cross-lingual semantic search capabilities.
Streamlit UI: Built interactive interface with language selection dropdown and real-time translation display with truncated previews.

3. Multimodal Vision & Audio Chatbot (Q3)

Built a containerized multimodal chatbot supporting image and audio inputs alongside text queries.

Vision Model: Integrated Gemini 1.0 Pro Vision via LangChain's ChatVertexAI with temperature=0.7 and max_output_tokens=1000 for image understanding.
Image Processing: Uploaded images to Google Cloud Storage bucket and generated signed URLs (15-min expiration) for secure model access.
Speech-to-Text: Used Google Cloud Speech-to-Text API with MP3 encoding at 44.1kHz sample rate for audio transcription.
Combined Inference: Built context-aware pipeline that combines uploaded images with voice queries-audio transcriptions are used to query the currently loaded image for multimodal Q&A.
Docker Deployment: Containerized the application with Dockerfile for consistent deployment across environments.

Implementation Details

GCP Project: Deployed on Google Cloud Platform with Vertex AI initialization for us-central1 region.
Authentication: Service account credentials via JSON key file for secure API access.
Storage: GCS bucket for image/audio uploads with automatic signed URL generation.
UI Framework: Streamlit for all three applications with session state management for cross-modal context sharing.

Results

RAG system successfully retrieves contextually relevant lecture content with MMR providing more diverse results than pure similarity search.
Translation pipeline handles full syllabus documents with per-chunk processing for API compliance.
Multimodal chatbot accurately describes image content and answers voice queries about uploaded images.
All three applications deployed with clean Streamlit interfaces and modular Python architecture.

Limitations

ChromaDB persistence requires local storage; cloud-hosted vector stores would improve scalability.
Translation quality depends on Google Translate; domain-specific terminology may require post-processing.
Gemini Vision model has input size limits; large images are resized before processing.
Audio transcription accuracy varies with recording quality and background noise.

Skills and Technologies Demonstrated

Google Vertex AI (text-embedding-004, Gemini 1.0 Pro Vision)
ChromaDB vector store with persistent storage
LangChain for RAG pipeline composition
Google Cloud APIs (Speech-to-Text, Translation, Storage)
Streamlit application development
Docker containerization for ML applications
MMR and similarity-based retrieval strategies
PDF processing with PyPDF2