AI Profiling

Javad Kasravi

March 12, 2026

Who’s Stealing My Speed?

Agenda

Node Communications
NVIDIA AI Profiling tools
Single GPU training
Distributed Data Parallel (DDP)
- Single node training
- Multi node training
DDP scaling

Single Node Communications

Base System

Naive communication

Data Path:

GPU 0  →  PCI Bus  →  System Memory →  PCI Bus →  GPU 1

PCI Bus Peer-to-Peer (P2P) communication

Data Path:

GPU 0  →  PCI Bus →  GPU 1

GPUDirect P2P communication (NVLink)

Data Path

GPU 0  →  NVLink  →  GPU 2

Throughput Comparison

Communication Type	Throughput
Naive communication	~16 GB/s per GPU 🐢
PCIe Bus P2P communication	~32 GB/s per GPU 🚗
GPUDirect P2P communication	300 GB/s total per GPU 🏎️

Multinode Commumications

GPUDirect Without RDMA Communication

GPUDirect With RDMA Communication

Throughput Comparison

Communication Type	Throughput
GPUDirect Without RDMA	<50 GB/s 🐢
GPUDirect With RDMA	~ 50 GB/s per node (2 HDR InfiniBand) 🏎️

AI Profiling?

NVIDIA profiling Tools

Nsight Systems GUI

Go to the following link and download the Nsight System 2025.3.1:

https://developer.nvidia.com/tools-downloads

Run Profiling

    srun env -u CUDA_VISIBLE_DEVICES bash -c 'torchrun \
       --nproc-per-node=gpu \
       --nnodes="$SLURM_JOB_NUM_NODES" \
       --rdzv-id="$SLURM_JOB_ID" \
       --rdzv-endpoint="$MASTER_ADDR":"$MASTER_PORT" \
       --rdzv-backend=c10d \
       --rdzv-conf=is_host="$(if ((SLURM_NODEID)); then echo 0; else echo 1; fi)" \
       --local-addr="$(if ((SLURM_NODEID)); then echo $MASTER_ADDR; else hostname; fi)" \
       --no-python ./run_profile.sh train/ddp_training.py --profile'

nsys profile \
    --duration=30 \
    --delay=200 \
    --gpu-metrics-device=all \
    --nic-metrics=true \
    --stop-on-exit=false \
    --trace=nvtx,cuda,osrt \
    --python-sampling=true \
    --python-sampling-frequency=1 \
    --cuda-memory-usage=true \
    --force-overwrite=true \
    --python-functions-trace=profiler/config/profiling.json \
    --output=nsys_logs/nsys_logs_rank_${RANK} \
    --python-backtrace=cuda \
    --cudabacktrace=all \
    python -u "$SCRIPT_NAME" "$@"

Run Profiling

Inside of profiler/config/profiling.json

    {
        "domain": "PyTorch",
        "color": "E8B795",
        "module": "torch.amp",
        "functions": [
            "GradScaler.scale",
            "GradScaler.unscale",
            "GradScaler.unscale_",
            "GradScaler.step",
            "GradScaler.update"
        ]
    },
    ...

We also trace the NVTX trace

for step in range(num_steps):
    with ExecutionTimer("data_loading (to Sys. mem)", profile=True) as t:
        src, tgt = next(train_iter)
    with ExecutionTimer("data_movement (to GPU mem)", profile=True) as t:
        src, tgt = src.to(device, non_blocking=False), tgt.to(device, non_blocking=False)
    with ExecutionTimer("forward_step", profile=True) as t:
        output = model(src)
    ...

Nsight Systems (Deep Learning App.)

Single GPU without dataloader Worker

train_loader = DataLoader(train_dataset, 
                        batch_size=args.batch_size, 
                        num_workers=0,
                        pin_memory=False)

sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile

Only the main process transfers data to system memory.

Single GPU without dataloader Worker

Move the trace folder to your local machine by running:

scp -r -4 <user>@jureca.fz-juelich.de:/p/project1/training2609/AI_profiling/Nsys_trace_update_Jan_2026 .

File -> Open -> Single_GPU/Report_00_zero_woker_unpined_non_blocking_False/nsys_logs/540_WVB_725_nsys_logs_rank.nsys-rep

Use + and − keys to zoom in and out

Go to this link to answer the questions:

Single GPU without dataloader Worker

📘 Exercise

Find the Python process with CUDA HW
Find the Python thread inside the above process
How long does it take until one iteration is finished (data transfers, forward, backward, …)?
Which part of training takes the most time: data loading, data movement, the forward pass, or the backward pass?
How long does it take until one data loading step is finished?
Explore CUDA HW thread and what is the GPU peak memory?

Single GPU Multiworkers

Main process send indexes to workers

index 0 → worker 1
index 1 → worker 2
index 2 → worker 3
index 3 → worker 1

Single GPU Multiworkers

train_loader = DataLoader(train_dataset, 
                        batch_size=args.batch_size, 
                        num_workers=4,
                        pin_memory=False)

sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile

Single GPU Multiworkers

File -> Open -> Single_GPU/Single_GPU/Report_01_multi_woker_unpined_non_blocking_False/nsys_logs/647_ATA_566_nsys_logs_rank.nsys-rep

Go to this link to answer the questions:

Single GPU Multiworkers

📘 Exercise

Find pt_data_worker processes
How many pt_data_worker traces are created by Nsys?
Which part of training takes the most time: data loading, data movement, the forward pass, or the backward pass?
How long does it take until one data loading step is finished?

Single GPU Multiworkers

DMA (Direct Memory Access)

DMA allows some hardware to access memory without involving the CPU
Copying between CPU & GPU uses DMA
What is the advatage of data transfer by the DMA?
What is the problem of data transfer by the DMA?

DMA (Direct Memory Access)

DMA accesses memory using physical addresses.
- Cannot detect if the OS swaps a virtual page with another virtual page at the same physical address.
How can we prevent data corruption during DMA transfers?
- Pinning Memory!!

How does cudaMemcpy copy data from host to device?

cudaMemcpy

What happens during cudaMemcpy:
Host to Device:
The CPU copies data to a pinned memory buffer.
DMA transfers the data from the pinned memory buffer to GPU memory.
Device to Host:
DMA transfers data from GPU memory to a pinned memory buffer.
The CPU copies data from the pinned memory buffer.
Disadvantage:
Every cudaMemcpy actually involves two copies.

Memory pinning

Single GPU (Asyn. Transfer)

train_loader = DataLoader(train_dataset, 
                        batch_size=args.batch_size, 
                        num_workers=4,
                        pin_memory=True)

    src, tgt = src.to(device, non_blocking=True), tgt.to(device, non_blocking=True)

sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile

Single GPU Multiworkers (Asyn. Transfer)

File -> Open -> Single_GPU/Report_02_multiwoker_pined_non_blocking_True/nsys_logs/107_RKN_402_nsys_logs_rank.nsys-rep

Go to this link to answer the questions:

Single GPU Multiworkers (Asyn. Transfer)

📘 Exercise

Check one iteration of training (PyTorch trace)
Which part of training takes the most time: data loading, data movement, the forward pass, or the backward pass?
Is there any CUDA synchronization in the trace?

DDP

sbatch --disable-dcgm --nodes 1 ddp_training.sbatch --profile

Trace Path:

Multi_GPUs/DDP/Single_node/Report_03_DDP_one_node/nsys_logs

File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports

Go to this link to answer the questions:

DDP (4 GPUs)

📘 Exercise

Which intra-node communication is active (PCIe or NVLink)?
Check NIC metrics (Why is there some traffic on one node?)
Check the NCCL trace inside of CUDA HW
How many times is the all-reduce operation called (check NCCL trace inside of CUDA HW)?
Do the compute kernels overlap with the NCCL all-reduce operation?

DDP

sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile

Trace Path:

Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/GPU_Direct_RDMA_enable/nsys_logs

File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports

Go to this link to answer the questions:

DDP (8 GPUs)

📘 Exercise

Which intra-node communication is active (PCIe or NVLink)?
Check the NCCL trace inside of CUDA HW
Do the compute kernels overlap with the NCCL all-reduce operation?
Do you think the training will be scalable?

DDP

NCCL_P2P_LEVEL=LOC sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile

Trace Path:

Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/GPU_Direct_RDMA_disable/nsys_logs

File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports

Go to this link to answer the questions:

DDP (8 GPUs without GPU_Direct_RDMA)

📘 Exercise

Check the NCCL trace inside of CUDA HW
Do the compute kernels overlap with the NCCL all-reduce operation?
Do you think the training will be scalable?

DDP

NCCL_IB_DISABLE=1 sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile

Trace Path:

Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/NOT_IB_USAGE

File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports

Go to this link to answer the questions:

DDP (8 GPUs without Infiniband usage)

📘 Exercise

Do the compute kernels overlap with the NCCL all-reduce operation?
Do you think the training will be scalable?

AI Profiling

Who’s Stealing My Speed?

Agenda

Single Node Communications

Base System

Naive communication

PCI Bus Peer-to-Peer (P2P) communication

GPUDirect P2P communication (NVLink)

Throughput Comparison

Multinode Commumications

GPUDirect Without RDMA Communication

GPUDirect With RDMA Communication

Throughput Comparison

AI Profiling?

NVIDIA profiling Tools

Nsight Systems GUI

Run Profiling

Run Profiling

Nsight Systems (Deep Learning App.)

Single GPU without dataloader Worker

Single GPU without dataloader Worker

Single GPU Multiworkers

Single GPU Multiworkers

Single GPU Multiworkers

Single GPU Multiworkers

Single GPU Multiworkers

DMA (Direct Memory Access)

DMA (Direct Memory Access)

cudaMemcpy

Memory pinning

Single GPU (Asyn. Transfer)

Single GPU Multiworkers (Asyn. Transfer)

DDP

DDP

DDP

DDP

DDP

DDP scaling