AI Profiling

Javad Kasravi

March 12, 2026

Who’s Stealing My Speed?

Agenda

  • Node Communications
  • NVIDIA AI Profiling tools
  • Single GPU training
  • Distributed Data Parallel (DDP)
    • Single node training
    • Multi node training
  • DDP scaling

Single Node Communications

Base System

Naive communication

Data Path:

GPU 0  →  PCI Bus  →  System Memory →  PCI Bus →  GPU 1

PCI Bus Peer-to-Peer (P2P) communication

Data Path:

GPU 0  →  PCI Bus →  GPU 1

Data Path

GPU 0  →  NVLink  →  GPU 2

Throughput Comparison

Communication Type Throughput
Naive communication ~16 GB/s per GPU 🐢
PCIe Bus P2P communication ~32 GB/s per GPU 🚗
GPUDirect P2P communication 300 GB/s total per GPU 🏎️

Multinode Commumications

GPUDirect Without RDMA Communication

GPUDirect With RDMA Communication

Throughput Comparison

Communication Type Throughput
GPUDirect Without RDMA <50 GB/s 🐢
GPUDirect With RDMA ~ 50 GB/s per node (2 HDR InfiniBand) 🏎️

AI Profiling?

NVIDIA profiling Tools

Nsight Systems GUI

Go to the following link and download the Nsight System 2025.3.1:

https://developer.nvidia.com/tools-downloads

Run Profiling

    srun env -u CUDA_VISIBLE_DEVICES bash -c 'torchrun \
       --nproc-per-node=gpu \
       --nnodes="$SLURM_JOB_NUM_NODES" \
       --rdzv-id="$SLURM_JOB_ID" \
       --rdzv-endpoint="$MASTER_ADDR":"$MASTER_PORT" \
       --rdzv-backend=c10d \
       --rdzv-conf=is_host="$(if ((SLURM_NODEID)); then echo 0; else echo 1; fi)" \
       --local-addr="$(if ((SLURM_NODEID)); then echo $MASTER_ADDR; else hostname; fi)" \
       --no-python ./run_profile.sh train/ddp_training.py --profile'
nsys profile \
    --duration=30 \
    --delay=200 \
    --gpu-metrics-device=all \
    --nic-metrics=true \
    --stop-on-exit=false \
    --trace=nvtx,cuda,osrt \
    --python-sampling=true \
    --python-sampling-frequency=1 \
    --cuda-memory-usage=true \
    --force-overwrite=true \
    --python-functions-trace=profiler/config/profiling.json \
    --output=nsys_logs/nsys_logs_rank_${RANK} \
    --python-backtrace=cuda \
    --cudabacktrace=all \
    python -u "$SCRIPT_NAME" "$@"

Run Profiling

Inside of profiler/config/profiling.json

    {
        "domain": "PyTorch",
        "color": "E8B795",
        "module": "torch.amp",
        "functions": [
            "GradScaler.scale",
            "GradScaler.unscale",
            "GradScaler.unscale_",
            "GradScaler.step",
            "GradScaler.update"
        ]
    },
    ...

We also trace the NVTX trace

for step in range(num_steps):
    with ExecutionTimer("data_loading (to Sys. mem)", profile=True) as t:
        src, tgt = next(train_iter)
    with ExecutionTimer("data_movement (to GPU mem)", profile=True) as t:
        src, tgt = src.to(device, non_blocking=False), tgt.to(device, non_blocking=False)
    with ExecutionTimer("forward_step", profile=True) as t:
        output = model(src)
    ...

Nsight Systems (Deep Learning App.)

Single GPU without dataloader Worker

train_loader = DataLoader(train_dataset, 
                        batch_size=args.batch_size, 
                        num_workers=0,
                        pin_memory=False)
sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile

Only the main process transfers data to system memory.

Single GPU without dataloader Worker

Move the trace folder to your local machine by running:
scp -r -4 <user>@jureca.fz-juelich.de:/p/project1/training2609/AI_profiling/Nsys_trace_update_Jan_2026 .
File -> Open -> Single_GPU/Report_00_zero_woker_unpined_non_blocking_False/nsys_logs/540_WVB_725_nsys_logs_rank.nsys-rep

Use + and − keys to zoom in and out

Go to this link to answer the questions:

Single GPU without dataloader Worker

📘 Exercise
  • Find the Python process with CUDA HW
  • Find the Python thread inside the above process
  • How long does it take until one iteration is finished (data transfers, forward, backward, …)?
  • Which part of training takes the most time: data loading, data movement, the forward pass, or the backward pass?
  • How long does it take until one data loading step is finished?
  • Explore CUDA HW thread and what is the GPU peak memory?

Single GPU Multiworkers

Single GPU Multiworkers

Main process send indexes to workers

  • index 0 → worker 1
  • index 1 → worker 2
  • index 2 → worker 3
  • index 3 → worker 1

Single GPU Multiworkers

train_loader = DataLoader(train_dataset, 
                        batch_size=args.batch_size, 
                        num_workers=4,
                        pin_memory=False)
sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile

Single GPU Multiworkers

File -> Open -> Single_GPU/Single_GPU/Report_01_multi_woker_unpined_non_blocking_False/nsys_logs/647_ATA_566_nsys_logs_rank.nsys-rep

Go to this link to answer the questions:

Single GPU Multiworkers

📘 Exercise
  • Find pt_data_worker processes
  • How many pt_data_worker traces are created by Nsys?
  • Which part of training takes the most time: data loading, data movement, the forward pass, or the backward pass?
  • How long does it take until one data loading step is finished?

Single GPU Multiworkers

DMA (Direct Memory Access)

  • DMA allows some hardware to access memory without involving the CPU
  • Copying between CPU & GPU uses DMA
  • What is the advatage of data transfer by the DMA?
  • What is the problem of data transfer by the DMA?

DMA (Direct Memory Access)

  • DMA accesses memory using physical addresses.
    • Cannot detect if the OS swaps a virtual page with another virtual page at the same physical address.
  • How can we prevent data corruption during DMA transfers?
    • Pinning Memory!!

How does cudaMemcpy copy data from host to device?

cudaMemcpy

  • What happens during cudaMemcpy:
  • Host to Device:
  • The CPU copies data to a pinned memory buffer.
  • DMA transfers the data from the pinned memory buffer to GPU memory.
  • Device to Host:
  • DMA transfers data from GPU memory to a pinned memory buffer.
  • The CPU copies data from the pinned memory buffer.
  • Disadvantage:
  • Every cudaMemcpy actually involves two copies.

Memory pinning

Single GPU (Asyn. Transfer)

train_loader = DataLoader(train_dataset, 
                        batch_size=args.batch_size, 
                        num_workers=4,
                        pin_memory=True)
    src, tgt = src.to(device, non_blocking=True), tgt.to(device, non_blocking=True)
sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile

Single GPU Multiworkers (Asyn. Transfer)

File -> Open -> Single_GPU/Report_02_multiwoker_pined_non_blocking_True/nsys_logs/107_RKN_402_nsys_logs_rank.nsys-rep

Go to this link to answer the questions:

Single GPU Multiworkers (Asyn. Transfer)

📘 Exercise
  • Check one iteration of training (PyTorch trace)
  • Which part of training takes the most time: data loading, data movement, the forward pass, or the backward pass?
  • Is there any CUDA synchronization in the trace?

DDP

DDP

sbatch --disable-dcgm --nodes 1 ddp_training.sbatch --profile
Trace Path:
    Multi_GPUs/DDP/Single_node/Report_03_DDP_one_node/nsys_logs
File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports

Go to this link to answer the questions:

DDP (4 GPUs)

📘 Exercise
  • Which intra-node communication is active (PCIe or NVLink)?
  • Check NIC metrics (Why is there some traffic on one node?)
  • Check the NCCL trace inside of CUDA HW
  • How many times is the all-reduce operation called (check NCCL trace inside of CUDA HW)?
  • Do the compute kernels overlap with the NCCL all-reduce operation?

DDP

sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile
Trace Path:
    Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/GPU_Direct_RDMA_enable/nsys_logs
File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports

Go to this link to answer the questions:

DDP (8 GPUs)

📘 Exercise
  • Which intra-node communication is active (PCIe or NVLink)?
  • Check the NCCL trace inside of CUDA HW
  • Do the compute kernels overlap with the NCCL all-reduce operation?
  • Do you think the training will be scalable?

DDP

NCCL_P2P_LEVEL=LOC sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile
Trace Path:
    Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/GPU_Direct_RDMA_disable/nsys_logs
File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports

Go to this link to answer the questions:

DDP (8 GPUs without GPU_Direct_RDMA)

📘 Exercise
  • Check the NCCL trace inside of CUDA HW
  • Do the compute kernels overlap with the NCCL all-reduce operation?
  • Do you think the training will be scalable?

DDP

NCCL_IB_DISABLE=1 sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile
Trace Path:
    Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/NOT_IB_USAGE
File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports

Go to this link to answer the questions:

DDP (8 GPUs without Infiniband usage)

📘 Exercise
  • Do the compute kernels overlap with the NCCL all-reduce operation?
  • Do you think the training will be scalable?

DDP scaling