NCP-AII 시험 - NVIDIA실제시험문제와 답 - 300문항

Question No : 1

You are tasked with ensuring optimal power efficiency for a GPU server running machine learning workloads. You want to dynamically adjust the GPU’s power consumption based on its utilization.
Which of the following methods is the MOST suitable for achieving this, assuming the server’s BIOS and the NVIDIA drivers support it?

A.Manually set the GPU’s power limit using ‘nvidia-smi -pl and create a script to monitor utilization and adjust the power limit periodically.
B.Configure the server’s BIOS/UEFI to use a power-saving profile, which will automatically reduce the GPU’s power consumption when idle.
C.Enable Dynamic Boost in the NVIDIA Control Panel (if available), which will automatically allocate power between the CPU and GPU based on their current needs.
D.Use NVIDIA’s Data Center GPU Manager (DCGM) to monitor GPU utilization and dynamically adjust the power limit based on a predefined policy.
E.Disable ECC (Error Correcting Code) on the GPU to reduce power consumption.

정답:
Explanation:
DCGM provides the most comprehensive and automated solution for dynamic power management. It can monitor GPIJ utilization in real-time and adjust the power limit based on predefined policies, ensuring optimal power efficiency without manual intervention. Manually adjusting the power limit is possible but requires scripting and continuous monitoring. Dynamic Boost is typically for laptops, and BIOS power profiles may not be fine-grained enough. Disabling ECC reduces power but compromises data integrity.

Question No : 2

You are setting up a virtualized environment (using VMware vSphere) to run GPU-accelerated workloads. You have multiple physical GPUs in your server and want to assign specific GPUs to different virtual machines (VMs) for dedicated access.
Which vSphere technology would BEST support this?

A.VMware vMotion
B.VMware High Availability (HA)
C.VMware DirectPath I/O (Passthrough)
D.VMware vGPU
E.VMware DRS (Distributed Resource Scheduler)

정답:
Explanation:
VMware DirectPath I/O (Passthrough) allows a VM to have exclusive access to a physical PCle device, such as a GPIJ. This provides the best performance because the VM can directly access the GPU without virtualization overhead. vGPI allows sharing of a GPU among multiple VMs, but DirectPath I/O provides dedicated access. vMotion migrates VMs. HA restarts VMS after failure. DRS balances resources across hosts.

Question No : 3

Consider a scenario where you are setting up a high-performance computing cluster with several GPU-accelerated nodes using Slurm as the resource manager. You want to ensure that jobs requesting GPUs are only scheduled on nodes with the appropriate NVIDIA drivers and CUDA toolkit installed.
How can you achieve this within Slurm?

A.Use Slurm’s ‘GresTypeS configuration option in ‘slurm.conf to define a generic resource type called ‘gpu’ and then configure each node to advertise the available GPIJs. Slurm will automatically ensure that jobs requesting GPUs are only scheduled on nodes with the ‘gpu’ resource.
B.Create a custom Slurm script that checks for the presence of the NVIDIA driver and CUDA toolkit before submitting a job to a node. If the requirements are not met, the job is rejected.
C.Use Slurm’s node features to tag nodes with the "Feature=‘ keyword in ‘slurm.conf. For example, tag nodes with GPUs as ‘Feature=gpu’. Jobs can then request nodes with the ‘gpu’ feature using the option.
D.Install the NVIDIA Data Center GPU Manager (DCGM) on each node and configure Slurm to query DCGM for GPU availability and health. Slurm will then only schedule jobs on healthy and available GPUs.
E.Utilize Slurm’s Prolog and Epilog scripts to dynamically install the necessary NVIDIA drivers and CUDA toolkit on each node before and after a job runs. This ensures that the required software is always available.

정답:
Explanation:
Using Slurm’s node features is the most straightforward and recommended approach for tagging nodes with specific capabilities. The ‘―constraint’ option allows jobs to request nodes with particular features. GresTypeS can be used, but node features provide more flexibility and control. Installing drivers dynamically is impractical and inefficient. DCGM is primarily for monitoring, not core scheduling requirements.

Question No : 4

You need to remotely monitor the GPU temperature and utilization of a server without installing any additional software on the server itself.
Assuming you have network access to the server’s BMC (Baseboard Management Controller), which protocol and standard data format would BEST facilitate this?

A.SNMP (Simple Network Management Protocol) with MIB (Management Information Base)
B.HTTP with JSON
C.SSH with plain text output from ‘nvidia-smi’
D.IPMI (Intelligent Platform Management Interface) with SDR (Sensor Data Records)
E.Syslog with CSV (Comma-separated Values)

정답:
Explanation:
IPMI is a standard interface for out-of-band server management, commonly used for monitoring hardware sensors like temperature and utilization. BMCs typically support IPMI. SDRs are the data format used by IPMI for sensor data. SNMP is also an option, but IPMI is more directly tied to hardware monitoring. The rest are less efficient or require additional software installation.

Question No : 5

You are configuring a server with multiple GPUs for CUDA-aware MPI.
Which environment variable is critical for ensuring proper GPU affinity, so that each MPI process uses the correct GPU?

A.CUDA VISIBLE DEVICES
B.CUDA DEVICE ORDER
C.LD LIBRARY PATH
D.MPI GPU SUPPORT
E.CUDA LAUNCH BLOCKING-I

정답:
Explanation:
‘CUDA VISIBLE DEVICES’ is essential for GPU affinity. It allows you to specify which GPUs are visible to a particular process. Without it, all processes might try to use the same GPU, leading to performance bottlenecks. controls the order in which GPUs are enumerated. specifies the path to shared libraries. is hypothetical. forces synchronous CUDA calls.

Question No : 6

You’ve installed a server with multiple NVIDIAAIOO GPUs intended for use with Kubernetes and NVIDIA’s GPU Operaton After installing the GPU Operator, you notice that the GPUs are not being properly detected and managed by Kubernetes.
Which of the following are potential causes and troubleshooting steps you should take?

A.The NVIDIA drivers are not properly installed on the host operating system before installing the GPU Operator. Verify the driver installation using ‘nvidia-smr.
B.The Kubernetes nodes are not labeled correctly to indicate the presence of NVIDIA GPUs. Use ‘kubectl label node nvidia.com/gpu.present=true’.
C.The NVIDIA Container Toolkit is not installed on the Kubernetes nodes. Install the toolkit according to NVIDIA’s documentation.
D.The GPU Operator’s configuration is incorrect, preventing it from properly discovering and managing the GPUs. Check the GPU Operator’s logs and configuration files.
E.The ‘nvidia-docker2 runtime is not set as the default runtime in ‘/etc/docker/daemon.json’. Change the default runtime to ‘nvidia’ and restart the Docker daemon.

정답:
Explanation:
All the options are valid reasons. The NVIDIA driver must be present on the host, the nodes need to be labelled to be recongnized by the Kubernetes, container tookit is required for running GPU enabled container and configuration of GPU operator must be correct.

Question No : 7

A user reports that their GPU-accelerated application is crashing with a CUDA error related to ‘out of memory’. You have confirmed that the GPU has sufficient physical memory.
What are the likely causes and troubleshooting steps?

A.The application is leaking GPU memory. Use a memory profiling tool like ‘cuda-memcheck’ to identify the source of the leak.
B.The application is requesting a larger block of memory than is available in a single allocation. Try breaking the allocation into smaller chunks or using managed memory.
C.The CUDA driver version is incompatible with the CUDA runtime version used by the application. Update the CUDA driver to match the runtime version.
D.The process has exceeded the maximum number of GPU contexts allowed. Reduce the number of concurrent CUDA applications running on the GP
E.The system’s virtual memory is exhausted. Increase the swap space.

정답:
Explanation:
Memory leaks and single-allocation limits are common causes of ‘out of memory’ errors, even when sufficient physical memory exists. ‘cuda-memcheck’ is specifically designed to find memory errors in CUDA applications. While driver incompatibility is possible, leaks and allocation size limits are more frequent occurrences.

Question No : 8

You are installing a GPU server in a data center with limited cooling capacity.
Which of the following server configuration choices would BEST help minimize the server’s thermal output, without significantly compromising performance? Assume all options are compatible.

A.Choose GPUs with a lower TDP (Thermal Design Power), even if it means using older generation GPUs.
B.Use a passively cooled CPU to reduce fan noise and power consumption.
C.Configure the BIOS/UEFI to aggressively throttle CPU and GPU frequencies under heavy load.
D.Implement liquid cooling for the GPUs and CPUs.
E.Increase the ambient temperature of the data center to reduce the temperature differential.

정답:
Explanation:
Liquid cooling is the most effective way to remove heat from high-power components like GPUs and CPUs, allowing them to operate at their maximum performance without overheating. Choosing lower TDP GPUs will reduce thermal output but will also significantly reduce performance. Throttle frequency is useful, but liquid cooling enables optimal performance within thermal constraints. Data center should reduce cooling cost but is counter intuitive to reduce server temparature.

Question No : 9

When installing a GPU driver on a Linux system that already has a previous driver version installed, what is the recommended procedure to ensure a clean and stable installation?

A.Simply install the new driver package using ‘apt install’ or ‘yum install’ without removing the old driver.
B.Blacklist the nouveau driver, download the CUDA toolkit, and run the installation script with default options.
C.Purge the existing NVIDIA driver packages using ‘apt purge nvidia- or ‘yum remove nvidia- s, reboot the system, and then install the new driver package.
D.Run ‘nvidia-uninstall’ if it exists, otherwise manually remove the NVIDIA kernel modules and libraries from ‘/lib/modules’ and ‘/usr/lib’.
E.Install the new driver using the .run’ file from NVIDIA’s website, accepting all default options.

정답:
Explanation:
Purging the existing drivers using the package manager ensures that all related files and configurations are removed, preventing conflicts with the new driver. Rebooting after purging allows the system to load without the old drivers. While using the .run file is an option, using the package manager (if available) is generally preferred for easier management.

Question No : 10

You need to verify the NVLink connectivity between GPUs in a DGX server.
Which command-line utility is the MOST reliable and provides detailed NVLink status?

A.nvidia-smi
B.Ispci
C.nvlink_info (Hypothetical command)
D.gpustat
E.dcgmi diag -t 1004

정답:
Explanation:
‘dcgmi diag -t 1004’ is the correct command. ‘nvidia-smi’ provides basic GPIJ information, but ‘dcgmi diag -t 1004’ (part of the Data Center GPU Manager) provides specific diagnostic tests for NVLink connectivity. ‘Ispci’ lists PCle devices, not specifically NVLink. ‘gpustat’ is a monitoring tool. ‘nvlink_info’ is hypothetical.

Question No : 11

You are deploying a multi-GPU server for deep learning training. After installing the GPUs, the system boots, but ‘nvidia-smi’ only detects one GPU. The motherboard has multiple PCle slots, all of which are physically capable of supporting GPUs.
What is the most probable cause?

A.The other GPUs are not properly seated in their PCle slots. Reseat the GPUs and ensure they are securely connected.
B.The other GPUs are faulty and need to be replaced. Test each GPU individually to confirm their functionality.
C.The system BIOS/UEFI is not configured to enable all PCle slots or the PCle lanes are not allocated correctly. Check the BIOS/IJEFI settings to enable all slots and configure the PCle lane allocation (e.g., x16/x8/x8).
D.The NVIDIA drivers are not installed correctly or are incompatible with the GPUs. Reinstall the drivers and ensure they are compatible with the specific GPU model and CUDA version.
E.The power supply is not providing enough power to all GPIJs. Upgrade to a higher wattage power supply.

정답:
Explanation:
Incorrect BIOS/UEFI settings are the most likely cause when GPUs are physically present but not detected. The BIOS controls PCle lane allocation and slot enabling. Reseating GPUs is a good first step, but if the BIOS is misconfigured, it won’t resolve the issue. Insufficient power is also a possibility, but BIOS configuration is more common in initial setup.

Question No : 12

You are tasked with installing a DGX A100 server. After racking and connecting power and network cables, you power it on, but the BMC (Baseboard Management Controller) is not accessible via the network. You have verified the network cable is connected and the switch port is active.
What are the MOST likely causes and initial troubleshooting steps you should take?
A. The BMC IP address is not configured or is on a different subnet. Check the BMC’s network configuration using the DGX’s front panel or via serial console. Verify DHCP is enabled and functioning or manually configure a static IP address.
B. The BMC firmware is corrupted and needs to be reflashed using a USB drive. Check the DGX support site for the latest BMC firmware.
C. The BMC is not powered on because the main power supply is faulty. Verify the power supply LEDs are lit and providing power to the system.
D. The network switch port is not configured for the correct VLAN. Verify the switch port configuration to ensure it is on the same VLAN as the BMC.
E. The BMC is faulty and needs to be replaced. Contact NVIDIA support for RMA.

정답: A,D
Explanation:
The most likely causes are network configuration issues (incorrect IP, subnet, or VLAN). The BMC requires a valid IP configuration and network connectivity to be accessible. While other options are possible, they are less common as initial causes.

Question No : 13

You are deploying a multi-tenant A1 infrastructure with strict isolation requirements.
Which network technology would be most suitable for creating isolated virtual networks for each tenant?

A.VLANs (Virtual LANs)
B.VXLAN (Virtual Extensible LAN)
C.QinQ (802. lad)
D.GRE (Generic Routing Encapsulation)
E.IPsec

정답:
Explanation:
VXLAN is most suitable for multi-tenant environments because it provides a larger address space (24-bit VNI) compared to VLANs (12-bit VLAN ID), allowing for a greater number of isolated networks. VXLAN also supports Layer 2 connectivity across Layer 3 networks, facilitating VM mobility across different subnets. While QinQ can extend the VLAN ID space, it’s not as scalable as VXLAN. GRE provides tunneling but doesn’t inherently provide isolation. IPsec is primarily for secure communication.

Question No : 14

In a data center utilizing NVIDIA GPUs and NVLink, what is the primary advantage of using a direct-attached NVLink network topology compared to routing traffic over the network?

A.Increased network security
B.Higher bandwidth and lower latency between GPUs
C.Reduced cost of network infrastructure
D.Simplified network configuration
E.Improved power efficiency

정답:
Explanation:
Direct-attached NVLink provides significantly higher bandwidth and lower latency compared to routing traffic over a traditional network. This is crucial for applications that require intensive GPU-to-GPU communication, such as large-scale AI training. While direct-attached NVLink can simplify configuration in some cases, its primary advantage is the performance improvement.

Question No : 15

You are tasked with designing a high-performance network for a large-scale recommendation system. The system requires low latency and high throughput for both training and inference.
Which interconnect technology is MOST suitable for connecting the nodes within the cluster?

A.Gigabit Ethernet
B.10 Gigabit Ethernet
C.InfiniBand
D.Fibre Channel
E.100 Gigabit Ethernet

정답:
Explanation:
InfiniBand is designed for high-performance computing and offers significantly lower latency and higher bandwidth compared to Ethernet or Fibre Channel, making it the most suitable choice for demanding workloads like recommendation systems. While 100 Gigabit Ethernet provides high bandwidth, InfiniBand generally offers lower latency.

NVIDIA NCP-AII 시험