News & Events

ICDS News

ICDS engineering staff work on a scheduled outage of the high-performance computing systems (Roar). Credit: Todd Price/ICDS

Penn State’s ICDS team ensures seamless research continuity during scheduled downtime 

Posted on November 11, 2024

UNIVERSITY PARK, Pa. The Penn State Institute for Computational and Data Sciences (ICDS) technical team showcased exceptional partnership as part of a planned maintenance outage on the Roar supercomputer system over the weekend of Oct. 25 through 27 at the Penn State University Data Center. 

What is an HPC Outage? 

A high-performance computing (HPC) outage refers to a temporary interruption in the services provided by Penn State’s high-performance computing systems (Roar). These systems are crucial for handling large-scale computations and data processing tasks that are beyond the capabilities of standard computers. HPC systems are used in various fields including scientific research, financial modeling and weather forecasting. 

Why do we need outages? 

Outages are necessary for several key reasons: 

  • Maintenance and upgrades: Just like any other complex system, HPC infrastructure requires regular maintenance to ensure it operates efficiently and reliably. This includes hardware repairs, software updates and system optimizations. 
  • Security enhancements: Regular outages allow for the implementation of critical security updates and patches. This helps protect the system from potential cyber threats and vulnerabilities. 
  • Performance improvements: Outages provide an opportunity to upgrade components and integrate new technologies that can enhance the overall performance of the HPC system. This might involve installing faster processors, increasing storage capacity or improving network connectivity. 

What happens during an outage? 

During an HPC outage, a team of skilled staff members and vendors work diligently to carry out various tasks, including:

  • System diagnostics: Technicians perform comprehensive diagnostics to identify any issues or areas that need improvement. This involves checking hardware components, running software tests and analyzing system performance. 
  • Repairs and replacements: Maintenance crews address any identified issues by repairing or replacing faulty hardware. This could include swapping out damaged processors, upgrading memory modules or fixing network connections. 
  • Software updates: HPC specialists install the latest software updates and security patches to ensure the system is protected against vulnerabilities and runs smoothly. 
  • Testing and validation: After maintenance and upgrades, the Roar system undergoes rigorous testing to validate that everything is functioning correctly. This step is crucial to ensure that the Penn State HPC system is ready to handle its workload once it comes back online. 
  • Communication and coordination: Throughout the outage, the ICDS communication teams keep stakeholders informed about the progress and expected completion times. They also coordinate with users to minimize disruptions and ensure a smooth transition back to normal operations. 

 

Why is this work so important? 

The work done during an HPC outage is vital for several reasons: 

  • Reliability: Regular maintenance and upgrades help prevent unexpected failures and ensure the Penn State HPC system remains reliable and available when needed. 
  • Security: Implementing security updates and patches protects the system from cyber threats, safeguarding sensitive data and maintaining the integrity of computations. 
  • Efficiency: Performance improvements and optimizations enhance the system’s ability to handle complex tasks, leading to faster and more accurate results. 
  • Innovation: By keeping the Roar HPC system up-to-date with the latest technologies, ICDS supports cutting-edge research and innovation across various fields. 

 

This scheduled downtime, which is part of five annually scheduled outages, ensures that researchers across the University are always informed and can plan their work accordingly. The outages are also coordinated with Penn State IT, facilities and the Office of the Senior Vice President for Research (OSVPR). The collaboration ensures all upgrades are performed efficiently and hereby minimizing disruption to the Penn State research community. 

“We work with the Office of Physical Plant (OPP) and the Data Center team to make sure they have what they need on a facilities level for what we are going to be doing. Managing the process has a project-based feel of what the team will be doing on the system, what they won’t be doing on the system and what would be the ideal order of events,” said Eric Huyett, research and development engineer and lead on ICDS downtime processes. 

The downtime team includes Erik Byer, research and development engineer; Mike Gallo, systems engineer; Rob Groner, research and development software engineer; Derick Haigh, research and development engineer; Matt Hansen, research and development engineer; Huyett; Ross Mickens, design engineering team lead; Rob See, HPS consultant; and Gary Skouson, systems engineering team lead. 

During these outages, team members identify specific issues such as non-functional parts, software glitches or performance bottlenecks. They collaborate to troubleshoot these problems, which can include replacing faulty processors, updating outdated software or optimizing network configurations. Additionally, they thoroughly examine hardware components like memory modules and storage devices to ensure everything works properly after the updates. 

“ICDS works well due to our domain knowledge of cluster set up, automation tasking and intervention of system commands,” Haigh said. “It is best to not pull the rug off for everyone. Sometimes it is best to tell people to get off the rug. We don’t want to be de-bugging a core when everyone is on the system.” 

The ICDS team’s dedication and proficiency ensure that the Roar supercomputer remains a reliable resource for Penn State researchers, allowing them to continue their groundbreaking work with minimal interruption. This downtime was not just a technical necessity but a demonstration of the power of teamwork and shared expertise in advancing the university’s research mission. 

“ICDS works synchronously and in specialized formats,” Hansen, who also works with the Open on Demand Portal (OOD) and Virtual Machines (VR) structures, said. “Once the event is over, we always look at what we can do better. We have a follow up meeting to discuss the outage to make sure testing takes place in case a rollback needs to take place to not affect researcher workflows. ICDS is always looking for constant improvement in the process.” 

Share

Related Posts