NVMe:无法从D3cold状态切换到D0状态,设备无法访问
问题描述
突然,在无明显触发因素的情况下,我的NVMe驱动器出现故障,/dev/nvme0n1p1 设备消失:
1
2
3
4
5
|
[136975.461964] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[136975.461978] nvme nvme0: Does your device have a faulty power saving mode enabled?
[136975.461983] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[136975.533762] nvme 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
[136975.534208] nvme nvme0: Disabling device after reset failure: -19write
|
随后出现了大量底层设备的缓冲区I/O错误。我尝试卸载并重新加载nvme和nvme-core模块,但问题仍未解决:
1
2
3
|
[138185.373024] nvme 0000:03:00.0: platform quirk: setting simple suspend
[138185.375409] nvme nvme0: pci function 0000:03:00.0
[138185.376542] nvme 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
|
是否可以在不重启计算机的情况下恢复NVMe驱动器?我使用的是内核6.12。
解决方案
我找到了一个解决方案。通过从PCIe树中移除设备,然后重新扫描总线,设备可以恢复正常:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
root@rtrbox:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 4.1G 1 loop /rtr/primary/squashfs/root
zram0 253:0 0 16.8G 0 disk [SWAP]
root@rtrbox:~# rmmod nvme
root@rtrbox:~# rmmod nvme-core
root@rtrbox:~# echo 1 > /sys/devices/pci0000:00/0000:00:03.0/remove
root@rtrbox:~# echo 1 > /sys/bus/pci/rescan
root@rtrbox:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 4.1G 1 loop /rtr/primary/squashfs/root
zram0 253:0 0 16.8G 0 disk [SWAP]
nvme0n1 259:0 0 931.5G 0 disk
└─nvme0n1p1 259:1 0 64G 0 part
|
从内核日志可以看到:
1
2
3
4
5
6
7
8
9
10
11
|
[138759.380378] pci 0000:03:00.0: [144d:a80c] type 00 class 0x010802 PCIe Endpoint
[138759.381686] pci 0000:03:00.0: BAR 0 [mem 0xffffffffffffc000-0xffffffffffffffff 64bit]
[138759.382215] pci 0000:03:00.0: Max Payload Size set to 256 (was 16384, max 512)
[138759.383176] pci 0000:03:00.0: Adding to iommu group 15
[138759.383824] pcieport 0000:00:02.4: ASPM: current common clock configuration is inconsistent, reconfiguring
[138759.391544] pci 0000:03:00.0: BAR 0 [mem 0xfce00000-0xfce03fff 64bit]: assigned
[138759.392408] nvme 0000:03:00.0: platform quirk: setting simple suspend
[138759.393216] nvme nvme0: pci function 0000:03:00.0
[138759.395956] nvme nvme0: D3 entry latency set to 10 seconds
[138759.401342] nvme nvme0: 16/0/0 default/read/poll queues
[138759.407779] nvme0n1: p1
|