AI Profiling
Javad Kasravi
March 12, 2026
Who’s Stealing My Speed?
![]()
Agenda
- Node Communications
- NVIDIA AI Profiling tools
- Single GPU training
- Distributed Data Parallel (DDP)
- Single node training
- Multi node training
- DDP scaling
Single Node Communications
Base System
![]()
Naive communication
Data Path:
GPU 0 → PCI Bus → System Memory → PCI Bus → GPU 1
![]()
PCI Bus Peer-to-Peer
(P2P) communication
Data Path:
![]()
GPUDirect P2P communication
(NVLink)
Data Path
![]()
Throughput Comparison
| Naive communication |
~16 GB/s per GPU 🐢 |
| PCIe Bus P2P communication |
~32 GB/s per GPU 🚗 |
| GPUDirect P2P communication |
300 GB/s total per GPU 🏎️ |
GPUDirect Without RDMA Communication
![]()
GPUDirect With RDMA Communication
![]()
Throughput Comparison
| GPUDirect Without RDMA |
<50 GB/s 🐢 |
| GPUDirect With RDMA |
~ 50 GB/s per node (2 HDR InfiniBand) 🏎️ |
Nsight Systems GUI
Go to the following link and download the Nsight System
2025.3.1:
https://developer.nvidia.com/tools-downloads
Run Profiling
srun env -u CUDA_VISIBLE_DEVICES bash -c 'torchrun \
--nproc-per-node=gpu \
--nnodes="$SLURM_JOB_NUM_NODES" \
--rdzv-id="$SLURM_JOB_ID" \
--rdzv-endpoint="$MASTER_ADDR":"$MASTER_PORT" \
--rdzv-backend=c10d \
--rdzv-conf=is_host="$(if ((SLURM_NODEID)); then echo 0; else echo 1; fi)" \
--local-addr="$(if ((SLURM_NODEID)); then echo $MASTER_ADDR; else hostname; fi)" \
--no-python ./run_profile.sh train/ddp_training.py --profile'
nsys profile \
--duration=30 \
--delay=200 \
--gpu-metrics-device=all \
--nic-metrics=true \
--stop-on-exit=false \
--trace=nvtx,cuda,osrt \
--python-sampling=true \
--python-sampling-frequency=1 \
--cuda-memory-usage=true \
--force-overwrite=true \
--python-functions-trace=profiler/config/profiling.json \
--output=nsys_logs/nsys_logs_rank_${RANK} \
--python-backtrace=cuda \
--cudabacktrace=all \
python -u "$SCRIPT_NAME" "$@"
Run Profiling
Inside of profiler/config/profiling.json
{
"domain": "PyTorch",
"color": "E8B795",
"module": "torch.amp",
"functions": [
"GradScaler.scale",
"GradScaler.unscale",
"GradScaler.unscale_",
"GradScaler.step",
"GradScaler.update"
]
},
...
We also trace the NVTX trace
for step in range(num_steps):
with ExecutionTimer("data_loading (to Sys. mem)", profile=True) as t:
src, tgt = next(train_iter)
with ExecutionTimer("data_movement (to GPU mem)", profile=True) as t:
src, tgt = src.to(device, non_blocking=False), tgt.to(device, non_blocking=False)
with ExecutionTimer("forward_step", profile=True) as t:
output = model(src)
...
Nsight Systems (Deep Learning
App.)
![]()
Single GPU without
dataloader Worker
train_loader = DataLoader(train_dataset,
batch_size=args.batch_size,
num_workers=0,
pin_memory=False)
sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile
Only the main process transfers data to system memory.
Single GPU without
dataloader Worker
Move the trace folder to your local machine by running:
scp -r -4 <user>@jureca.fz-juelich.de:/p/project1/training2609/AI_profiling/Nsys_trace_update_Jan_2026 .
File -> Open -> Single_GPU/Report_00_zero_woker_unpined_non_blocking_False/nsys_logs/540_WVB_725_nsys_logs_rank.nsys-rep
Use + and − keys to zoom in and out
Go to this link to answer the questions:
📘 Exercise
-
Find the Python process with CUDA HW
-
Find the Python thread inside the above process
-
How long does it take until one iteration is finished (data transfers,
forward, backward, …)?
-
Which part of training takes the most time: data loading, data movement,
the forward pass, or the backward pass?
-
How long does it take until one data loading step is finished?
-
Explore CUDA HW thread and what is the GPU peak memory?
Single GPU Multiworkers
![]()
Main process send indexes to workers
- index 0 → worker 1
- index 1 → worker 2
- index 2 → worker 3
- index 3 → worker 1
Single GPU Multiworkers
train_loader = DataLoader(train_dataset,
batch_size=args.batch_size,
num_workers=4,
pin_memory=False)
sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile
Single GPU Multiworkers
File -> Open -> Single_GPU/Single_GPU/Report_01_multi_woker_unpined_non_blocking_False/nsys_logs/647_ATA_566_nsys_logs_rank.nsys-rep
Go to this link to answer the questions:
📘 Exercise
-
Find
pt_data_worker processes
-
How many pt_data_worker traces are created by Nsys?
-
Which part of training takes the most time: data loading, data movement,
the forward pass, or the backward pass?
-
How long does it take until one data loading step is finished?
Single GPU Multiworkers
![]()
DMA (Direct Memory Access)
![]()
-
DMA allows some hardware to access memory without involving the CPU
-
Copying between CPU & GPU uses DMA
-
What is the advatage of data transfer by the DMA?
-
What is the problem of data transfer by the DMA?
DMA (Direct Memory Access)
-
DMA accesses memory using physical addresses.
-
Cannot detect if the OS swaps a virtual page with another virtual page
at the same physical address.
-
How can we prevent data corruption during DMA transfers?
![]()
How does cudaMemcpy copy data from host to device?
cudaMemcpy
-
What happens during
cudaMemcpy:
-
Host to Device:
-
The CPU copies data to a pinned memory buffer.
-
DMA transfers the data from the pinned memory buffer to GPU memory.
-
Device to Host:
-
DMA transfers data from GPU memory to a pinned memory buffer.
-
The CPU copies data from the pinned memory buffer.
-
Disadvantage:
-
Every
cudaMemcpy actually involves two copies.
Memory pinning
![]()
Single GPU (Asyn. Transfer)
train_loader = DataLoader(train_dataset,
batch_size=args.batch_size,
num_workers=4,
pin_memory=True)
src, tgt = src.to(device, non_blocking=True), tgt.to(device, non_blocking=True)
sbatch --disable-dcgm single_gpu_training_profiling.sbatch --profile
Single GPU Multiworkers
(Asyn. Transfer)
File -> Open -> Single_GPU/Report_02_multiwoker_pined_non_blocking_True/nsys_logs/107_RKN_402_nsys_logs_rank.nsys-rep
Go to this link to answer the questions:
📘 Exercise
-
Check one iteration of training (PyTorch trace)
-
Which part of training takes the most time: data loading, data movement,
the forward pass, or the backward pass?
-
Is there any CUDA synchronization in the trace?
DDP
sbatch --disable-dcgm --nodes 1 ddp_training.sbatch --profile
Trace Path:
Multi_GPUs/DDP/Single_node/Report_03_DDP_one_node/nsys_logs
File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports
Go to this link to answer the questions:
📘 Exercise
-
Which intra-node communication is active (PCIe or NVLink)?
-
Check NIC metrics (Why is there some traffic on one node?)
-
Check the NCCL trace inside of CUDA HW
-
How many times is the all-reduce operation called (check NCCL trace
inside of CUDA HW)?
-
Do the compute kernels overlap with the NCCL all-reduce operation?
DDP
sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile
Trace Path:
Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/GPU_Direct_RDMA_enable/nsys_logs
File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports
Go to this link to answer the questions:
📘 Exercise
-
Which intra-node communication is active (PCIe or NVLink)?
-
Check the NCCL trace inside of CUDA HW
-
Do the compute kernels overlap with the NCCL all-reduce operation?
-
Do you think the training will be scalable?
DDP
NCCL_P2P_LEVEL=LOC sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile
Trace Path:
Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/GPU_Direct_RDMA_disable/nsys_logs
File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports
Go to this link to answer the questions:
📘 Exercise
-
Check the NCCL trace inside of CUDA HW
-
Do the compute kernels overlap with the NCCL all-reduce operation?
-
Do you think the training will be scalable?
DDP
NCCL_IB_DISABLE=1 sbatch --disable-dcgm --nodes 2 ddp_training.sbatch --profile
Trace Path:
Multi_GPUs/DDP/Multi_nodes/Report_03_DDP_multi_nodes/NOT_IB_USAGE
File -> Open -> select all .nsys-rep files --> create a multi-view report -> select all reports
Go to this link to answer the questions:
📘 Exercise
-
Do the compute kernels overlap with the NCCL all-reduce operation?
-
Do you think the training will be scalable?
DDP scaling
![]()