Troubleshooting NVMe disks


This document lists errors that you might encounter when using disks with the nonvolatile memory express (NVMe) interface.

You can use the NVMe interface for Local SSDs and persistent disks (Persistent Disk or Google Cloud Hyperdisk). Only the most recent machine series, such as Tau T2A, M3, C3, C3D, and H3 use the NVMe interface for Persistent Disk. Confidential VMs also use NVMe for Persistent Disk. All other Compute Engine machine series use the SCSI disk interface for persistent disks.

I/O operation timeout error

If you are encountering I/O timeout errors, latency could be exceeding the default timeout parameter for I/O operations submitted to NVMe devices.

Error message:

[1369407.045521] nvme nvme0: I/O 252 QID 2 timeout, aborting
[1369407.050941] nvme nvme0: I/O 253 QID 2 timeout, aborting
[1369407.056354] nvme nvme0: I/O 254 QID 2 timeout, aborting
[1369407.061766] nvme nvme0: I/O 255 QID 2 timeout, aborting
[1369407.067168] nvme nvme0: I/O 256 QID 2 timeout, aborting
[1369407.072583] nvme nvme0: I/O 257 QID 2 timeout, aborting
[1369407.077987] nvme nvme0: I/O 258 QID 2 timeout, aborting
[1369407.083395] nvme nvme0: I/O 259 QID 2 timeout, aborting
[1369407.088802] nvme nvme0: I/O 260 QID 2 timeout, aborting
...

Resolution:

To resolve this issue, increase the value of the timeout parameter.

  1. View the current value of the timeout parameter.

    1. Determine which NVMe controller is used by the persistent disk or Local SSD volume.
      ls -l /dev/disk/by-id
      
    2. Display the io_timeout setting, specified in seconds, for the disk.

      cat /sys/class/nvme/CONTROLLER_ID/NAMESPACE/queue/io_timeout
      
      Replace the following:
      • CONTROLLER_ID: the ID of the NVMe disk controller, for example, nvme1
      • NAMESPACE: the namespace of the NVMe disk, for example, nvme1n1

      If you only have a single disk that uses NVMe, then use the command:

      cat /sys/class/nvme/nvme0/nvme0n1/queue/io_timeout
      
  2. To increase the timeout parameter for I/O operations submitted to NVMe devices, add the following line to the /lib/udev/rules.d/65-gce-disk-naming.rules file, and then restart the VM:

    KERNEL=="nvme*n*", ENV{DEVTYPE}=="disk", ATTRS{model}=="nvme_card-pd", ATTR{queue/io_timeout}="4294967295"
    

Detached disks still appear in the operating system of a compute instance

On VMs that use Linux kernel version 6.0 to 6.2, operations involving the Compute Engine API method instances.detachDisk or the gcloud compute instances detach-disk command might not work as expected. The Google Cloud console shows the device as removed, the compute instance metadata (compute disks describe command) shows the device as removed, but the device mount point and any symlinks created by udev rules are still visible in the guest operating system.

Error message:

Attempting to read from the detached disk on the VM results in I/O errors:

sudo head /dev/nvme0n3

head: error reading '/dev/nvme0n3': Input/output error

Issue:

Operating system images that use a Linux 6.0-6.2 kernel but don't include a backport of a NVMe fix fail to recognize when an NVMe disk is detached.

Resolution:

Reboot the VM to complete the process of removing the disk.

To avoid this issue, use an operating system with a Linux kernel version that doesn't have this problem:

  • 5.19 or older
  • 6.3 or newer

You can use the uname -r command in the guest OS to view the Linux kernel version.

If you attach Local SSD disks to a C3 or C3D VM, you might need to take additional steps to create the symlinks for the Local SSD disks. These steps are required only if you use any of the following public images offered by Google Cloud:

  • SLES 15 SP4 and SP5
  • SLES 12 SP4

These additional steps only apply to Local SSD disks; you don't need to do anything for Persistent Disk volumes.

The public Linux images listed previously don't have the correct udev configuration to create symlinks for Local SSD devices attached to C3 and C3D VMs. Custom images also might not include the required udev rules needed to create symlinks for Local SSD devices attached to C3 and C3D VMs.

Use these instructions to add udev rules for SUSE or custom images.

  1. Locate the udev rules directory. This is usually /lib/udev/rules.d or /usr/lib/udev/rules.d. Your image might have a different udev rules directory.
  2. Locate the 65-gce-disk-naming.rules file in the udev rules directory.
  3. If the 65-gce-disk-naming.rules file contains the following line, your image supports the new rules and you can stop here:

    KERNEL=="nvme*n*", ATTRS{model}=="nvme_card[0-9]*",IMPORT{program}="google_nvme_id -d $tempnode"
    
  4. If the preceding line is not present, or if the file 65-gce-disk-naming.rules doesn't exist, replace the existing file, or create a new file, with the file contents from this URL: https://raw.githubusercontent.com/GoogleCloudPlatform/guest-configs/20230630.00/src/lib/udev/rules.d/65-gce-disk-naming.rules. This file contains the updated contents of the 65-gce-disk-naming.rules file, including the line from the previous step and other rules required for Compute Engine disk naming. For example:

    sudo curl -o 65-gce-disk-naming.rules https://raw.githubusercontent.com/GoogleCloudPlatform/guest-configs/20230630.00/src/lib/udev/rules.d/65-gce-disk-naming.rules
    
  5. Go to the udev directory.

  6. Locate the google_nvme_id file in the udev directory.

  7. Replace the contents of the existing google_nvme_id file, or create a new file, with the contents at this URL:

    sudo curl -o google_nvme_id https://raw.githubusercontent.com/GoogleCloudPlatform/guest-configs/20230630.00/src/lib/udev/google_nvme_id
    
  8. Ensure the google_nvme_id file is executable.

    sudo chmod 755 google_nvme_id
    
  9. Reboot the VM.

  10. Verify that the symlinks are created successfully.

    ls -l /dev/disk/by-id/google-local-nvme-ssd*
    

    The output should list the same number of links as there are Local SSDs attached to the instance, and each link should point to a different /dev/nvme device path. For example:

    lrwxrwxrwx 1 root root 13 Jul 19 22:52 /dev/disk/by-id/google-local-nvme-ssd-0 -> ../../nvme0n1
    lrwxrwxrwx 1 root root 13 Jul 19 22:52 /dev/disk/by-id/google-local-nvme-ssd-1 -> ../../nvme1n1
    

    For more information about device names, see Device naming.

    You can verify that the /dev/nvme device paths are Local SSD devices by running lsblk. NVMe devices that show 375G in size are Local SSD devices.

What's next?