Close this window

Email Announcement Archive

[Users] Fixes for Recent Perlmutter Network and File System Issues

Author: Rebecca Hartman-Baker <rjhartmanbaker_at_lbl.gov>
Date: 2023-11-01 11:15:00

Dear NERSC Users, Since the September 28 Perlmutter upgrade, we have observed an increase in issues with network communication and with accessing files on the Home, Global Common, Community and Scratch file systems. Typical manifestations of these failures include an “UNDELIVERABLE” error or other MPI communication failure, and/or an I/O error or a job that produces no output. We realize that this uptick in failures has been disruptive for users, and we’ve been working closely with the vendor on remedies for these issues. On the evening of October 30, we deployed an update that we believe will remedy the network issues. Since we aren’t able to consistently reproduce the error, we can’t be certain that this completely resolves it. Therefore, please let us know via help.nersc.gov if any of your jobs that run after 9:00 pm (Pacific time) on October 30 encounter an “UNDELIVERABLE” or other MPI communication error. We are actively working with the vendor to address many of the I/O errors and hangs some users have been experiencing. In the interim, we are using a procedural fix to address some (but not all) of these issues shortly after they happen, preventing further cascading of failures. We appreciate your patience as we track down and fix these issues on Perlmutter! Regards, -Rebecca -- Rebecca Hartman-Baker, Ph.D User Engagement Group Leader National Energy Research Scientific Computing Center | Berkeley Lab rjhartmanbaker@lbl.gov | phone: (510) 486-4810 fax: (510) 486-6459 Pronouns: she/her/hers _______________________________________________ Users mailing list Users@nersc.gov

Close this window