CSE - Computer Science & Engineering

Course Code: CS5393
Course Name: GenAI for Computer Vision
Syllabus: This course explores how Generative AI is applied to modern computer vision tasks. Unlike existing domain elective courses, it specifically emphasized on vision-based generative AI models. It begins with mathematical foundations and classical vision techniques, followed by deep learning architectures. The course then introduces generative learning paradigms including GANs, VAEs, diffusion models, and transformers with a discussion regarding evaluation metrics and training challenges like mode collapse, diffusion noise scheduling, etc. Moreover, it includes LLM models for vision applications like GPT-4V, LLaMA, PaLM-E, Flamingo, etc. This course is primarily focusing on deep generative learning for computer vision tasks like Image Captioning, VQA, Scene Understanding etc. It further discusses multimodal generative models and agentic AI systems for automatic image synthesis and reasoning.

Content:
Mathematical Preliminaries: Statistics, basic optimization. Computer Vision Basics: Image representation, filtering, edge detection, transformations; Interest Point detection, Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HoG). Deep Learning fundamentals: Convolutional Neural Network (CNN), Basic Architecture, Popular CNN Architectures: Visual Geometry Group (VGG), Google Net, Inception Net, ResNet. Sequential Modelling: Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Context Vector and Attention. Introduction to Generative AI: Generative and Discriminative Learning; Generative Models: Auto-regressive Models, Generative Adversarial Network, Variational Auto-Encoder, Diffusion Models, Stable Diffusion, Transformers, Evaluation and Training of Generative Models; Large Language Model for Vision: Vision Transformer, Generative Pre-trained Transformer (GPT)-4V, Flamingo, LLaVA, PaLM-E; Vision Applications: Image Captioning, VQA, Scene Understanding. Multimodal Generative Models: Multimodal Fusion, Multimodal Transformers, Contrastive Language-Image Pre-training (CLIP). Agentic AI: AI Agent based generative workflows for computer vision tasks.
Texts: 1. Jay Alammar & Maarten Grootendorst (2024). Hands-on Large Language Models: Language Understanding and Generation. Oreilly & Associates Inc. ISBN: 1098150961
2. Rajalingappaa Shanmugamani (2018). Deep Learning for Computer Vision. Packt Publishing Limited. ISBN: 1788295625

Quick Links

Contact Info