Dear NERSC Users, Over the last few weeks we have been upgrading Perlmutter’s GPU nodes from the Slingshot10 interconnect to the Slingshot11 interconnect. This involves updating both the hardware and the software on the nodes. Each node will get its Network Interface Cards replaced with an upgraded version, plus an update to the associated software. When Perlmutter returns from the maintenance on Monday, all GPU queues (regular, interactive, debug, etc.) will steer jobs to GPU nodes with the Slingshot11 interconnect instead of Slingshot10. Recompilation is NOT required to use the nodes with the Slingshot11 interconnect, and after the maintenance you will not need to change your batch scripts to be automatically run on the Slingshot11 GPU nodes. All jobs that were already in the queue before the start of the maintenance will run on the type of nodes that were in the queue at time of submission (e.g., a job submitted to the regular queue today would run on Slingshot10 nodes next week). The CPU-only nodes have always had Slingshot11; no change is required for CPU-only jobs. If you wish to continue using the Slingshot10 GPU nodes, you will need to explicitly request them by adding "_ss10" to the queue name, e.g., "-C gpu -q regular_ss10". Jobs cannot use a mixture of Slingshot10 and Slingshot11 nodes. The login nodes will also be upgraded to the Slingshot11 interconnect. No user impact is expected. With Slingshot11, GPU nodes are upgraded from 2x12.5GB/s NICs to 4x25GB/s NICs. The additional bandwidth and NIC resources may bring an immediate benefit for communication-bound codes. Due to a known software issue, machine learning codes may initially be slower when run at scale on these nodes; we are waiting for a libfabric-optimized fix from the vendors to address this. In addition to the default QOS changes, we will also upgrade the CUDA driver to version 515.48.07 and the default version of the CUDA SDK to 11.7 (the older versions will remain on the system). This will make the cuda compatibility libraries unnecessary, so if you were employing work arounds to remove them they should no longer be needed. If you encounter errors, you may need to recompile your code. Please report any issue you encounter via a ticket <https://help.nersc.gov>. Thanks for your patience as we work to bring Perlmutter to its full performance. -- Rebecca Hartman-Baker, Ph.D User Engagement Group Leader National Energy Research Scientific Computing Center | Berkeley Lab rjhartmanbaker@lbl.gov | phone: (510) 486-4810 fax: (510) 486-6459 Pronouns: she/her/hers _______________________________________________ Users mailing list Users@nersc.gov