05busid 故障 gpu 显卡槽位

  1. 执行命令 nvidia-smi

    Fri Aug 16 15:47:10 2024
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.76                 Driver Version: 550.76         CUDA Version: 12.4     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:16:00.0 Off |                  Off |
    |  0%   45C    P8             25W /  425W |       2MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:34:00.0 Off |                  Off |
    |  0%   37C    P8             20W /  425W |       2MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   2  NVIDIA GeForce RTX 4090 D      On  |   00000000:52:00.0 Off |                  Off |
    |  0%   40C    P8             17W /  425W |       2MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   3  NVIDIA GeForce RTX 4090 D      On  |   00000000:CA:00.0 Off |                  Off |
    | 30%   44C    P2            223W /  425W |   16108MiB /  24564MiB |     99%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |    3   N/A  N/A     15863      C   PWmat                                       16100MiB |
    +-----------------------------------------------------------------------------------------+
    
  2. 以序号 3 为例, 记录 3 号显卡 Bus-Id 00000000:CA:00.0

    dmidecode -t slot | grep -i -10 CA:00.0
    Handle 0x000D, DMI type 9, 17 bytes
    System Slot Information
        Designation: CPU SLOT1 PCIe 5.0 X16
        Type: x16 <OUT OF SPEC>
        Current Usage: In Use
        Length: Long
        Characteristics:
                3.3 V is provided
                Opening is shared
                PME signal is supported
        Bus Address: 0000:ca:00.0
    

    3 号显卡对应服务器主板 PCI插槽槽位是 Designation: CPU SLOT1 PCIe 5.0 X16


  1. gpu 卡对应型号

pci.ids: https://github.com/pciutils/pciids/blob/master/pci.ids

# lspci -nn | grep VGA
# lspci | grep NVIDIA
02:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)   # 3090
02:00.0 VGA compatible controller: NVIDIA Corporation Device 2208 (rev a1)   # 3080ti
1a:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 SXM2 16GB] (rev a1)   # p100
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)   # 1080ti
31:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2684]  # NVIDIA GeForce RTX 4090