GPU-Accelerated Computing for AI and Computer Vision

Course at a Glance

  • Instructor: Fabio Tosi — Junior Assistant Professor (RTDA), DISI, University of Bologna
  • Total hours: 14 hours (4 lectures × 3.5 hours)
  • Format: Blended — in person + Microsoft Teams streaming
  • Venue: Viale Risorgimento 2, Bologna (room varies by date, see schedule below)
  • Dates: 3, 6, 8, 10 July 2026
  • Teams link (all sessions): Join Microsoft Teams Meeting
  • PhD Cycle: 41st

Overview

This PhD course provides a hands-on, end-to-end understanding of how modern AI and computer vision workloads execute on NVIDIA GPUs. It bridges the gap between low-level CUDA programming and high-level deep learning frameworks (PyTorch), with a strong emphasis on profiling and systematic optimization.

The course is aimed at PhD students in Computer Science, Engineering, and related disciplines who want to understand and optimize the GPU performance of their research code.

Schedule

All lectures will be held at Viale Risorgimento 2, Bologna (room varies by date — see table below), and streamed live on Microsoft Teams using the same link for all four sessions: Join Microsoft Teams Meeting

Update (July 1, 2026): room assignments have been revised due to lack of air conditioning in the originally assigned room. The schedule below reflects the current, correct rooms.

DateTimeLocation
Friday, 3 July 20269:00 – 12:30Room 5.5 + Teams
Monday, 6 July 20269:00 – 12:30Room 5.2 + Teams
Wednesday, 8 July 20269:00 – 12:30Room 1.5 + Teams
Friday, 10 July 20269:00 – 12:30Room 0.8 + Teams

Prerequisites

The course assumes:

  • C programming language. Students will read (and write) CUDA kernels, which are written in C/C++. Familiarity with pointers, memory allocation, and basic C syntax is required.
  • Python programming language. Students will read (and modify) PyTorch code throughout the course.
  • Computer architecture fundamentals (recommended): notions of memory hierarchy, caches, and pipelining at the level of an introductory undergraduate course will help in understanding how a GPU executes work.
  • Linear algebra basics (vectors, matrices, matrix multiplication) at the level of a first-year university course.

The course does not require:

  • Prior experience with CUDA, GPU programming, or parallel computing.
  • Prior experience with PyTorch, TensorFlow, or any deep learning framework.
  • Knowledge of computer vision, neural networks, or deep learning theory.

Note: basic familiarity with neural networks (what a layer is, what training and inference are) is not required, but students who have seen them before may find the PyTorch examples more immediately concrete.

Concepts from CUDA programming and the basics of how a GPU executes work will be introduced from scratch. PyTorch will be presented as a “kernel orchestrator” sitting on top of CUDA libraries — no prior PyTorch exposure is assumed.

Topics Covered

The course is organized in four 3.5-hour lectures, covering the following topics:

CUDA Programming Model

Threads, blocks, grids; index computation in 1D, 2D, 3D. Writing CUDA kernels: vector addition, matrix operations. Practical examples in computer vision: image rotation, flipping, 2D convolution.

CUDA Execution Model

Streaming Multiprocessors (SM), warps, SIMT execution. Tensor cores: low-precision matrix-multiply units. Occupancy and resource utilization.

CUDA Memory Model

Memory hierarchy: global, shared, registers, L1/L2 caches, constant memory. Memory coalescing and alignment. Pinned memory, Unified Virtual Addressing (UVA), Unified Memory.

CUDA in Python

PyCUDA: writing and launching CUDA kernels from Python. PyTorch tensors and the mapping between PyTorch operations and CUDA libraries (cuBLAS, cuDNN, ATen). The training loop as a sequence of CUDA kernel launches.

Profiling and Optimization

Profiling PyTorch with torch.profiler, Nsight Systems, Nsight Compute. Reading kernel timelines and identifying bottlenecks. Theoretical and measured FLOPs; the roofline model applied to PyTorch. Mixed precision (FP16, BF16, TF32) and tensor cores. Graph capture and kernel fusion with torch.compile.

Applications

Examples and case studies from real computer vision research code. Integrating custom CUDA kernels into PyTorch.

Learning Outcomes

By the end of the course, students will be able to:

  • Read and understand CUDA kernel code, and explain how a GPU executes it.
  • Write basic CUDA kernels for parallel data processing tasks.
  • Profile a PyTorch model with torch.profiler and interpret the results.
  • Identify whether a kernel is compute-bound or memory-bound and choose the appropriate optimization.
  • Apply mixed precision and torch.compile to accelerate inference and training, and measure the resulting speedup.
  • Integrate a custom CUDA kernel into a PyTorch model when no built-in operator is sufficient.

Final Verification

A Hands-On Optimization Project

The course concludes with a small hands-on optimization project. The objective is to apply the complete optimization workflow presented throughout the course to a parallel workload of your choice.

This can be a neural network, an image-processing pipeline, a handwritten CUDA C kernel, or—ideally—a workload from your own research. You will profile it, identify the performance bottleneck, implement an optimization, verify correctness, and submit a short report describing the process and the performance improvement achieved.

The Workflow

  1. Measure – Run the baseline implementation with a profiler.
  2. Diagnose – Determine whether the workload is compute- or memory-bound, and justify your conclusion with quantitative evidence.
  3. Optimize – Apply the optimization suggested by your analysis.
  4. Verify – Confirm that the optimized implementation still produces correct results.
  5. Measure Again – Profile the optimized version and report the actual speedup.

The goal is not simply to optimize a single program, but to demonstrate that you can apply a systematic performance-engineering workflow to any GPU workload—including your own research code.

Materials

Slides and code examples will be distributed to enrolled students after each lecture. All examples will be reproducible on a standard NVIDIA GPU.

Registration


Registration is required — for organizational purposes and to receive the Microsoft Teams link for remote attendance.

→ Open the Registration Form

Contact

For questions about the course, please contact: fabio.tosi5@unibo.it