NVMe设备电源状态切换故障的解决方案

本文详细记录了在Linux内核6.12环境下,NVMe固态硬盘突然无法访问的问题。通过分析内核日志中的电源状态切换错误,提供了无需重启系统即可恢复设备访问的完整解决方案,包括移除PCIe设备并重新扫描总线等具体操作步骤。

NVMe:无法从D3cold状态切换到D0状态,设备无法访问

问题描述

突然,在无明显触发因素的情况下,我的NVMe驱动器出现故障,/dev/nvme0n1p1 设备消失:

1
2
3
4
5
[136975.461964] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[136975.461978] nvme nvme0: Does your device have a faulty power saving mode enabled?
[136975.461983] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[136975.533762] nvme 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[136975.534208] nvme nvme0: Disabling device after reset failure: -19write

随后出现了大量底层设备的缓冲区I/O错误。我尝试卸载并重新加载nvme和nvme-core模块,但问题仍未解决:

1
2
3
[138185.373024] nvme 0000:03:00.0: platform quirk: setting simple suspend
[138185.375409] nvme nvme0: pci function 0000:03:00.0
[138185.376542] nvme 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible

是否可以在不重启计算机的情况下恢复NVMe驱动器?我使用的是内核6.12。

解决方案

我找到了一个解决方案。通过从PCIe树中移除设备,然后重新扫描总线,设备可以恢复正常:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
root@rtrbox:~# lsblk
NAME              MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
loop0               7:0    0   4.1G  1 loop  /rtr/primary/squashfs/root
zram0             253:0    0  16.8G  0 disk  [SWAP]
root@rtrbox:~# rmmod nvme
root@rtrbox:~# rmmod nvme-core
root@rtrbox:~# echo 1 > /sys/devices/pci0000:00/0000:00:03.0/remove
root@rtrbox:~# echo 1 > /sys/bus/pci/rescan
root@rtrbox:~# lsblk
NAME              MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
loop0               7:0    0   4.1G  1 loop  /rtr/primary/squashfs/root
zram0             253:0    0  16.8G  0 disk  [SWAP]
nvme0n1           259:0    0 931.5G  0 disk
└─nvme0n1p1       259:1    0    64G  0 part

从内核日志可以看到:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
[138759.380378] pci 0000:03:00.0: [144d:a80c] type 00 class 0x010802 PCIe Endpoint
[138759.381686] pci 0000:03:00.0: BAR 0 [mem 0xffffffffffffc000-0xffffffffffffffff 64bit]
[138759.382215] pci 0000:03:00.0: Max Payload Size set to 256 (was 16384, max 512)
[138759.383176] pci 0000:03:00.0: Adding to iommu group 15
[138759.383824] pcieport 0000:00:02.4: ASPM: current common clock configuration is inconsistent, reconfiguring
[138759.391544] pci 0000:03:00.0: BAR 0 [mem 0xfce00000-0xfce03fff 64bit]: assigned
[138759.392408] nvme 0000:03:00.0: platform quirk: setting simple suspend
[138759.393216] nvme nvme0: pci function 0000:03:00.0
[138759.395956] nvme nvme0: D3 entry latency set to 10 seconds
[138759.401342] nvme nvme0: 16/0/0 default/read/poll queues
[138759.407779]  nvme0n1: p1
comments powered by Disqus
使用 Hugo 构建
主题 StackJimmy 设计