Close this window

Email Announcement Archive

[Users] Remote Possibility of Data Corruption on NERSC File Systems

Author: Rebecca Hartman-Baker <rjhartmanbaker_at_lbl.gov>
Date: 2024-04-19 16:33:07

Dear NERSC Users: NERSC has been made aware of possible data corruption due to a bug in the underlying slingshot network driver, “kfilnd” (kfabric Lustre network driver). We anticipate minimal impact from the bug itself on the integrity of the files currently on NERSC file systems, but in the interest of avoiding the possibility of future data corruption from this bug, we have taken additional corrective actions to sidestep the issue altogether. Initially this bug was thought to affect only Perlmutter scratch, but we recently learned that it also affects file systems served by DVS (which is the way that the global homes, global common, and community (CFS) file systems are interfaced with Perlmutter compute nodes). Data access via the Perlmutter login nodes or the DTNs is not affected as these do not use DVS. To guard against the bug, we enabled checksums on Perlmutter scratch in mid-November of last year, and during the maintenance earlier this week we have migrated our DVS servers to use an alternative network protocol. The DVS protocol does not provide a mechanism for detecting data corruption in the network layer, but Lustre does. While the storage used by the CFS, homes, common, and DNA file systems provides end-to-end data checksums, it is possible that a file may have been corrupted via DVS after being read from or before being written to these file systems. This corruption error appears to be quite rare; in the five months that the Lustre checksums have been enabled on Perlmutter scratch, there have been only two instances where data retransmissions occurred because of checksum mismatches. Nevertheless, we thought it was important in the spirit of transparency to let NERSC users know about the possibility, however remote, of data corruption in their files. The dates for when the file systems were vulnerable to this data corruption are as follows: - Perlmutter scratch: July 11, 2022 - November 16, 2023 - Global Homes, Global Common, CFS, DNA: July 11, 2022 - April 17, 2024 If you are concerned about data corruption, you can perform an md5sum data check against a known good copy of the file. These changes come with a loss of performance. For Lustre, large scale I/O was ~17% slower after checksums were enabled. On DVS-served file systems (global homes, global common, CFS, DNA), we are still assessing the impact of the change on large-scale I/O but we expect performance to be slower. Thanks for your understanding and please don’t hesitate to contact us via https://help.nersc.gov with any questions! Regards, -Rebecca _______________________________________________ Users mailing list Users@nersc.gov

Close this window