GPU-Accelerated Computing for AI and Computer Vision
Course at a Glance
- Instructor: Fabio Tosi — Junior Assistant Professor (RTDA), DISI, University of Bologna
- Total hours: 14 hours (4 lectures × 3.5 hours)
- Format: Blended — in person + Microsoft Teams streaming
- Venue: Viale Risorgimento 2, Bologna — Room 1.4
- Dates: 3, 6, 8, 10 July 2026
- PhD Cycle: 41st
Overview
This PhD course provides a hands-on, end-to-end understanding of how modern AI and computer vision workloads execute on NVIDIA GPUs. It bridges the gap between low-level CUDA programming and high-level deep learning frameworks (PyTorch), with a strong emphasis on profiling and systematic optimization.
The course is aimed at PhD students in Computer Science, Engineering, and related disciplines who want to understand and optimize the GPU performance of their research code.
Schedule
All lectures will be held at Viale Risorgimento 2, Bologna — Room 1.4, and streamed live on Microsoft Teams:
| Date | Time | Location |
|---|---|---|
| Friday, 3 July 2026 | 9:00 – 12:30 | Room 1.4 + Teams |
| Monday, 6 July 2026 | 9:00 – 12:30 | Room 1.4 + Teams |
| Wednesday, 8 July 2026 | 9:00 – 12:30 | Room 1.4 + Teams |
| Friday, 10 July 2026 | 9:00 – 12:30 | Room 1.4 + Teams |
The Microsoft Teams link will be shared with registered students before the first lecture.
Prerequisites
The course assumes:
- C programming language. Students will read (and write) CUDA kernels, which are written in C/C++. Familiarity with pointers, memory allocation, and basic C syntax is required.
- Python programming language. Students will read (and modify) PyTorch code throughout the course.
- Computer architecture fundamentals (recommended): notions of memory hierarchy, caches, and pipelining at the level of an introductory undergraduate course will help in understanding how a GPU executes work.
- Linear algebra basics (vectors, matrices, matrix multiplication) at the level of a first-year university course.
The course does not require:
- Prior experience with CUDA, GPU programming, or parallel computing.
- Prior experience with PyTorch, TensorFlow, or any deep learning framework.
- Knowledge of computer vision, neural networks, or deep learning theory.
Note: basic familiarity with neural networks (what a layer is, what training and inference are) is not required, but students who have seen them before may find the PyTorch examples more immediately concrete.
Concepts from CUDA programming and the basics of how a GPU executes work will be introduced from scratch. PyTorch will be presented as a “kernel orchestrator” sitting on top of CUDA libraries — no prior PyTorch exposure is assumed.
Topics Covered
The course is organized in four 3.5-hour lectures, covering the following topics:
CUDA Programming Model
Threads, blocks, grids; index computation in 1D, 2D, 3D. Writing CUDA kernels: vector addition, matrix operations. Practical examples in computer vision: image rotation, flipping, 2D convolution.CUDA Execution Model
Streaming Multiprocessors (SM), warps, SIMT execution. Tensor cores: low-precision matrix-multiply units. Occupancy and resource utilization.CUDA Memory Model
Memory hierarchy: global, shared, registers, L1/L2 caches, constant memory. Memory coalescing and alignment. Pinned memory, Unified Virtual Addressing (UVA), Unified Memory.CUDA in Python
PyCUDA: writing and launching CUDA kernels from Python. PyTorch tensors and the mapping between PyTorch operations and CUDA libraries (cuBLAS, cuDNN, ATen). The training loop as a sequence of CUDA kernel launches.Profiling and Optimization
Profiling PyTorch withtorch.profiler, Nsight Systems, Nsight Compute. Reading kernel timelines and identifying bottlenecks. Theoretical and measured FLOPs; the roofline model applied to PyTorch. Mixed precision (FP16, BF16, TF32) and tensor cores. Graph capture and kernel fusion with torch.compile.Applications
Examples and case studies from real computer vision research code. Integrating custom CUDA kernels into PyTorch.Learning Outcomes
By the end of the course, students will be able to:
- Read and understand CUDA kernel code, and explain how a GPU executes it.
- Write basic CUDA kernels for parallel data processing tasks.
- Profile a PyTorch model with
torch.profilerand interpret the results. - Identify whether a kernel is compute-bound or memory-bound and choose the appropriate optimization.
- Apply mixed precision and
torch.compileto accelerate inference and training, and measure the resulting speedup. - Integrate a custom CUDA kernel into a PyTorch model when no built-in operator is sufficient.
Final Verification
The course concludes with a short hands-on verification, designed to confirm that the main concepts can be applied in practice. The exact format is currently being defined and will be announced before the first lecture.
Materials
Slides and code examples will be distributed to enrolled students before each lecture. All examples will be reproducible on a standard NVIDIA GPU (Ampere or newer recommended).
Registration
Registration is required — for organizational purposes and to receive the Microsoft Teams link for remote attendance.
→ Open the Registration Form
Contact
For questions about the course, please contact: fabio.tosi5@unibo.it
