GPU-Accelerated Computing for AI and Computer Vision
Course at a Glance
- Instructor: Fabio Tosi — Junior Assistant Professor (RTDA), DISI, University of Bologna
- Total hours: 14 hours (4 lectures × 3.5 hours)
- Format: Blended — in person + Microsoft Teams streaming
- Venue: Viale Risorgimento 2, Bologna (room varies by date, see schedule below)
- Dates: 3, 6, 8, 10 July 2026
- Teams link (all sessions): Join Microsoft Teams Meeting
- PhD Cycle: 41st
Overview
This PhD course provides a hands-on, end-to-end understanding of how modern AI and computer vision workloads execute on NVIDIA GPUs. It bridges the gap between low-level CUDA programming and high-level deep learning frameworks (PyTorch), with a strong emphasis on profiling and systematic optimization.
The course is aimed at PhD students in Computer Science, Engineering, and related disciplines who want to understand and optimize the GPU performance of their research code.
Schedule
All lectures will be held at Viale Risorgimento 2, Bologna (room varies by date — see table below), and streamed live on Microsoft Teams using the same link for all four sessions: Join Microsoft Teams Meeting
Update (July 1, 2026): room assignments have been revised due to lack of air conditioning in the originally assigned room. The schedule below reflects the current, correct rooms.
| Date | Time | Location |
|---|---|---|
| Friday, 3 July 2026 | 9:00 – 12:30 | Room 5.5 + Teams |
| Monday, 6 July 2026 | 9:00 – 12:30 | Room 5.2 + Teams |
| Wednesday, 8 July 2026 | 9:00 – 12:30 | Room 1.5 + Teams |
| Friday, 10 July 2026 | 9:00 – 12:30 | Room 0.8 + Teams |
Prerequisites
The course assumes:
- C programming language. Students will read (and write) CUDA kernels, which are written in C/C++. Familiarity with pointers, memory allocation, and basic C syntax is required.
- Python programming language. Students will read (and modify) PyTorch code throughout the course.
- Computer architecture fundamentals (recommended): notions of memory hierarchy, caches, and pipelining at the level of an introductory undergraduate course will help in understanding how a GPU executes work.
- Linear algebra basics (vectors, matrices, matrix multiplication) at the level of a first-year university course.
The course does not require:
- Prior experience with CUDA, GPU programming, or parallel computing.
- Prior experience with PyTorch, TensorFlow, or any deep learning framework.
- Knowledge of computer vision, neural networks, or deep learning theory.
Note: basic familiarity with neural networks (what a layer is, what training and inference are) is not required, but students who have seen them before may find the PyTorch examples more immediately concrete.
Concepts from CUDA programming and the basics of how a GPU executes work will be introduced from scratch. PyTorch will be presented as a “kernel orchestrator” sitting on top of CUDA libraries — no prior PyTorch exposure is assumed.
Topics Covered
The course is organized in four 3.5-hour lectures, covering the following topics:
CUDA Programming Model
Threads, blocks, grids; index computation in 1D, 2D, 3D. Writing CUDA kernels: vector addition, matrix operations. Practical examples in computer vision: image rotation, flipping, 2D convolution.CUDA Execution Model
Streaming Multiprocessors (SM), warps, SIMT execution. Tensor cores: low-precision matrix-multiply units. Occupancy and resource utilization.CUDA Memory Model
Memory hierarchy: global, shared, registers, L1/L2 caches, constant memory. Memory coalescing and alignment. Pinned memory, Unified Virtual Addressing (UVA), Unified Memory.CUDA in Python
PyCUDA: writing and launching CUDA kernels from Python. PyTorch tensors and the mapping between PyTorch operations and CUDA libraries (cuBLAS, cuDNN, ATen). The training loop as a sequence of CUDA kernel launches.Profiling and Optimization
Profiling PyTorch withtorch.profiler, Nsight Systems, Nsight Compute. Reading kernel timelines and identifying bottlenecks. Theoretical and measured FLOPs; the roofline model applied to PyTorch. Mixed precision (FP16, BF16, TF32) and tensor cores. Graph capture and kernel fusion with torch.compile.Applications
Examples and case studies from real computer vision research code. Integrating custom CUDA kernels into PyTorch.Learning Outcomes
By the end of the course, students will be able to:
- Read and understand CUDA kernel code, and explain how a GPU executes it.
- Write basic CUDA kernels for parallel data processing tasks.
- Profile a PyTorch model with
torch.profilerand interpret the results. - Identify whether a kernel is compute-bound or memory-bound and choose the appropriate optimization.
- Apply mixed precision and
torch.compileto accelerate inference and training, and measure the resulting speedup. - Integrate a custom CUDA kernel into a PyTorch model when no built-in operator is sufficient.
Final Verification
A Hands-On Optimization Project
The course concludes with a small hands-on optimization project. The objective is to apply the complete optimization workflow presented throughout the course to a parallel workload of your choice.
This can be a neural network, an image-processing pipeline, a handwritten CUDA C kernel, or—ideally—a workload from your own research. You will profile it, identify the performance bottleneck, implement an optimization, verify correctness, and submit a short report describing the process and the performance improvement achieved.
The Workflow
- Measure – Run the baseline implementation with a profiler.
- Diagnose – Determine whether the workload is compute- or memory-bound, and justify your conclusion with quantitative evidence.
- Optimize – Apply the optimization suggested by your analysis.
- Verify – Confirm that the optimized implementation still produces correct results.
- Measure Again – Profile the optimized version and report the actual speedup.
The goal is not simply to optimize a single program, but to demonstrate that you can apply a systematic performance-engineering workflow to any GPU workload—including your own research code.
Materials
Slides and code examples will be distributed to enrolled students after each lecture. All examples will be reproducible on a standard NVIDIA GPU.
Registration
Registration is required — for organizational purposes and to receive the Microsoft Teams link for remote attendance.
→ Open the Registration Form
Contact
For questions about the course, please contact: fabio.tosi5@unibo.it
