How to Profile Modern PC Games with NVIDIA Nsight Graphics

April 6, 2026

In the evolving landscape of PC gaming, performance analysis has entered a new realm of complexity with the advent of DirectX 12 and Vulkan. The integration of technologies such as asynchronous compute, hardware ray tracing, temporal reconstruction, frame generation, and machine learning-assisted denoising has transformed the rendering process into a multifaceted endeavor. Understanding the intricacies of GPU performance now requires more than just basic overlays; it necessitates sophisticated tools that can dissect workload distribution, shader behavior, and frame-limiting stages. This is where NVIDIA Nsight Graphics proves invaluable.

Our exploration will guide you through a practical workflow for profiling GPU performance in contemporary games, using CD Projekt RED’s Cyberpunk 2077 as a focal point. The emphasis here is not on debugging rendering accuracy but on dissecting performance metrics: identifying GPU workload distribution, analyzing rasterized versus ray-traced scenes, and interpreting results through the lens of key Nsight Graphics features such as the GPU Trace Profiler, Shader Pipelines, and Hotspots. Our methodology is rooted in NVIDIA’s Peak-Performance-Percentage analysis, which prioritizes GPU-level evidence over assumptions about engine behavior.

What Is NVIDIA Nsight Graphics?

NVIDIA Nsight Graphics serves as the graphics debugger and profiler for modern graphics APIs, including Direct3D 12 and Vulkan. Among its various functionalities, the most critical for performance analysis is the GPU Trace Profiler, which provides an in-depth view of GPU execution across frames. This tool is designed to analyze GPU-bound scenarios, tracing shader execution on Streaming Multiprocessors (SMs) and pinpointing opportunities for simultaneous graphics and compute tasks, also known as async compute.

This distinction is crucial; while Graphics Capture allows for frame inspection and resource evaluation, the GPU Trace Profiler is tailored for performance analysis. It reveals GPU queue activities—graphics, compute, and copy queues—alongside synchronization, timing, and shader-level profiling metrics. These insights help ascertain whether performance bottlenecks arise from occupancy, memory bandwidth, or insufficient overlap between GPU queues.

Recent updates to Nsight Graphics have enhanced the GPU Trace Profiler’s utility for game analysis. The Shader Profiler now features a Flame Graph, and in D3D12 applications, it can display workloads created by NVIDIA’s DLSS and related SDK features. This makes the tool particularly relevant for profiling modern PC games that leverage ray tracing and machine learning-assisted techniques.

While numerous GPU profiling tools exist, each with unique strengths, this guide focuses on Nsight Graphics due to its deep integration with NVIDIA hardware, offering unparalleled metrics and profiling capabilities for NVIDIA GPUs.

The Profiling Methodology: Peak-Performance-Percentage Analysis

Discussions surrounding game performance often begin with assumptions. Observers may glance at a frame-time graph, note the presence of ray tracing, and hastily conclude that it is the performance bottleneck. Conversely, NVIDIA’s Peak-Performance-Percentage method adopts a more analytical approach, starting with GPU utilization metrics to identify which units are saturated and how close they are to their throughput limits, known as Speed Of Light (SOL).

The methodology is straightforward: first, identify the costly GPU workload. Next, analyze high-level throughput data to determine which GPU unit is most likely constraining performance. If no unit approaches a high percentage of its theoretical throughput, the focus shifts to improving utilization. If a unit nears its limit, the goal becomes either reducing its workload or restructuring tasks to alleviate pressure. This disciplined approach contrasts sharply with the guesswork often associated with visual complexity.

For this guide, we align this methodology with Nsight Graphics. The GPU Trace Profiler provides a comprehensive timeline and queue overview, while Top-Level Triage offers an initial assessment of the frame. The Shader Pipelines feature identifies the most demanding shader workloads, allowing us to delve into those shaders to discern whether issues stem from control-flow divergence, instruction mix, memory dependencies, or sheer workload size.

Test Setup And GPU Capture Strategy

We conducted two captures of Cyberpunk 2077 on a system equipped with the following specifications:

  • CPU: Intel Core i7-14700K;
  • RAM: 32 GB DDR5-7000 CL34;
  • Storage: 2 TB PCIe 4.0 NVMe SSD;
  • GPU: NVIDIA GeForce RTX 4090 24 GB;
  • Operating System: Windows 11 25H2;
  • All system firmware, drivers, BIOS, and OS updates were fully applied before testing.

The captures were performed in the same scene at a resolution of 2560×1440 (1440p), featuring two different scenarios:

  • High graphics preset with raster-only settings
  • High graphics preset with path tracing (RT Overdrive) plus DLSS Ray Reconstruction in Quality mode (a combination of upscaling and denoising for path-traced effects)

This setup effectively illustrates the progression in rendering complexity, contrasting traditional rasterization with the more demanding path-traced workload enhanced by DLSS Ray Reconstruction.

How To Profile A Game With Nsight Graphics

The workflow for utilizing Nsight Graphics is refreshingly straightforward once familiarized with the tool.

1. Launch NVIDIA App and Authorize Access to GPU Perf Counters

Begin by opening the NVIDIA App, navigating to SystemAdvanced, and setting Manage GPU Performance Counters to All users. This allows Nsight to access the GPU’s performance counters without requiring Administrator privileges.

2. Launch Nsight Graphics and Create a New Project

Next, open Nsight Graphics and create a new project via FileNew Project… in the top left toolbar.

3. Configure Capture Settings and Launch the Game Through Nsight

In the capture settings of the Start Activity window, select GPU Trace Profiler for performance analysis. Set the Application Executable to the game’s direct executable file path. Configure the following settings:

  • Timeline Metrics: Select Top-Level Triage
  • Enable Real-Time Shader Profiler
  • Leave advanced options (like Multi-Pass Metrics) disabled for a clean capture

These settings ensure the collection of both high-level GPU utilization data and shader-level profiling information. Click the Launch GPU Trace button to start the game through Nsight Graphics, which will monitor GPU activity in real-time.

4. Navigate to a Stable Test Scene

Once the game loads, proceed to a representative scene (e.g., a dense city area), stop all camera movement, and wait a few seconds for shader compilation and GPU workload stabilization. Confirm that Nsight Graphics is ready for GPU captures by checking that the top left menu indicates Data Collection: Ready (capture hotkey).

5. Capture a Frame (F11)

Press F11 to trigger a GPU Trace Profiler capture. Nsight will record GPU timestamps and metrics, capture shader execution data, and generate a detailed timeline of the frame. You can then return to Nsight Graphics to view the captured frame, renaming it as desired.

6. Open the GPU Trace Profiler Report

After opening the GPU capture in Nsight, you will see the main Timeline view, which includes:

  • Graphics queue
  • Compute queue
  • Per-GPU unit utilization metrics
  • Shader Pipelines
  • Flame Graph
  • Hotspots

The GPU Trace Profiler utilizes timestamps to construct a detailed timeline of draw and dispatch events alongside their execution duration.

Case Study: Cyberpunk 2077

To apply the profiling methodology, we captured and analyzed two representative GPU traces from Cyberpunk 2077 using NVIDIA Nsight Graphics’s GPU Trace Profiler. The first capture centers on a traditional rasterized workload (1440p High preset without ray/path tracing), providing a baseline for understanding how CD Projekt Red’s REDengine 4 structures and renders a frame. The second capture shifts to the game’s most demanding configuration—path tracing (RT Overdrive) combined with DLSS Ray Reconstruction at Quality mode—allowing us to examine how the workload evolves with advanced ray tracing and AI-assisted techniques. By comparing these scenarios, we can clearly illustrate how GPU workloads scale in complexity and how Nsight Graphics reveals underlying bottlenecks.

Important note: This analysis was conducted on the retail version of Cyberpunk 2077, lacking access to internal performance markers that would typically label individual render passes. Instead, we relied on the GPU Trace Profiler, Shader Pipelines, and Hotspots to infer workload structure and identify performance hotspots from the available data.

Rasterized Frame Trace

Initially, we captured and analyzed a single frame from Cyberpunk 2077 running at 1440p using the High preset, with all ray-traced effects disabled. This provides a clean baseline for understanding the structure of a modern rasterized frame and identifying performance limitations.

Frame Structure and Workload Distribution

By integrating GPU Trace Profiler timeline data with queue-level exports, we can reconstruct the frame into major workload regions. Although the absence of developer markers limits exact naming of render passes, the structure remains identifiable:

[Frame setup / shadow maps / depth pre-pass]

[Main scene geometry / G-buffer (draw-heavy raster)]

[Lighting and late scene processing]

[Indirect lighting, screen-space effects]

[Temporal anti-aliasing, post-processing, tone mapping]

[UI / HUD rendering]

A key observation is that the frame is heavily back-loaded, with the largest region—occurring late in the frame—accounting for approximately 48.5% of total queued GPU time. This region is characterized by a high density of compute shader dispatches and memory management operations specific to Direct3D 12.

In essence, the most costly aspect of this raster frame is not the geometry rendering but rather the processing of lighting after the geometry has been rendered, aligning with the deferred rendering approach of modern AAA game engines.

Hotspot Analysis

The Shader Profiler Hotspots section provides crucial evidence, revealing that the dominant compute shader is concentrated around key instructions, particularly:

  • sampleLevel texture fetch operations
  • Memory-dependent instructions with high latency

Primary stall reasons include:

  • Long Scoreboard (LGSB)
  • TEX Throttle

These stalls indicate that the shader’s performance is hindered not by computational complexity but by waiting on data, reinforcing our earlier frame-wide diagnosis.

Trace Analysis

Nsight’s Trace Analysis highlights several top issues:

  • L2 Limited
  • Warp Stalled by L1TEX Long Scoreboard
  • Warp Stalled by TEX Throttle

Under NVIDIA’s Peak-Performance-Analysis method, this combination suggests a workload limited by the memory subsystem rather than compute throughput, emphasizing the importance of memory behavior in modern rendering performance.

Rethinking Raster Performance

This case study underscores a critical reality of contemporary GPU performance:

Traditional rasterization alone no longer defines frame cost.

  • The majority of GPU time is spent after geometry submission.
  • The dominant workload is compute-driven lighting processing.
  • Performance is governed by data movement and memory behavior, not solely shader arithmetic.

This insight reveals why metrics such as teraflops and raw memory bandwidth often fail to capture the nuances of real-world GPU gaming performance.

Path-Traced (with DLSS RR Quality Mode) Frame Trace

Switching to path tracing significantly alters the GPU workload, increasing total frame cost from approximately 5.9 milliseconds (ms) in the raster case to around 11 ms in this capture, even with DLSS Ray Reconstruction’s optimizations. More importantly, it reshapes the distribution of time spent across rendering tasks.

Rather than a single dominant ray tracing pass, the frame evolves into a hybrid workload where rasterization, ray traversal, and AI-driven temporal reconstruction and denoising are intricately intertwined.

Despite the presence of explicit DispatchRays calls, the analysis reveals that the frame remains primarily compute-driven, with significant time allocated to lighting processing, temporal reconstruction, and denoising. The most demanding shaders continue to be compute shaders, underscoring the complexity of modern rendering pipelines.

Nsight Graphics’s Trace Analysis clarifies that the frame is primarily L2- and memory-path limited, with dominant stall reasons indicating frequent waits for data from the cache/memory subsystem. This suggests that the limiting factor is not merely the throughput of RT cores but the GPU’s ability to efficiently manage data flow through its memory hierarchy.

This observation carries significant implications: ray tracing performance does not scale linearly with RT-core throughput. Even with increased ray/triangle intersection rates, real-world performance remains constrained by factors such as cache locality and memory latency.

In summary, modern path-traced rendering is best understood as a compute and memory-bound pipeline, where the interplay of ray tracing, temporal reconstruction, and denoising forms a tightly coupled system, with memory efficiency playing a pivotal role in overall performance.

AppWizard
How to Profile Modern PC Games with NVIDIA Nsight Graphics