Penn State’s ICDS team ensures seamless research continuity during scheduled downtime
Posted on November 11, 2024UNIVERSITY PARK, Pa. – The Penn State Institute for Computational and Data Sciences (ICDS) technical team showcased exceptional partnership as part of a planned maintenance outage on the Roar supercomputer system over the weekend of Oct. 25 through 27 at the Penn State University Data Center.
What is an HPC Outage?
A high-performance computing (HPC) outage refers to a temporary interruption in the services provided by Penn State’s high-performance computing systems (Roar). These systems are crucial for handling large-scale computations and data processing tasks that are beyond the capabilities of standard computers. HPC systems are used in various fields including scientific research, financial modeling and weather forecasting.
Why do we need outages?
Outages are necessary for several key reasons:
- Maintenance and upgrades: Just like any other complex system, HPC infrastructure requires regular maintenance to ensure it operates efficiently and reliably. This includes hardware repairs, software updates and system optimizations.
- Security enhancements: Regular outages allow for the implementation of critical security updates and patches. This helps protect the system from potential cyber threats and vulnerabilities.
- Performance improvements: Outages provide an opportunity to upgrade components and integrate new technologies that can enhance the overall performance of the HPC system. This might involve installing faster processors, increasing storage capacity or improving network connectivity.
What happens during an outage?
During an HPC outage, a team of skilled staff members and vendors work diligently to carry out various tasks, including:
- System diagnostics: Technicians perform comprehensive diagnostics to identify any issues or areas that need improvement. This involves checking hardware components, running software tests and analyzing system performance.
- Repairs and replacements: Maintenance crews address any identified issues by repairing or replacing faulty hardware. This could include swapping out damaged processors, upgrading memory modules or fixing network connections.
- Software updates: HPC specialists install the latest software updates and security patches to ensure the system is protected against vulnerabilities and runs smoothly.
- Testing and validation: After maintenance and upgrades, the Roar system undergoes rigorous testing to validate that everything is functioning correctly. This step is crucial to ensure that the Penn State HPC system is ready to handle its workload once it comes back online.
- Communication and coordination: Throughout the outage, the ICDS communication teams keep stakeholders informed about the progress and expected completion times. They also coordinate with users to minimize disruptions and ensure a smooth transition back to normal operations.
Why is this work so important?
The work done during an HPC outage is vital for several reasons:
- Reliability: Regular maintenance and upgrades help prevent unexpected failures and ensure the Penn State HPC system remains reliable and available when needed.
- Security: Implementing security updates and patches protects the system from cyber threats, safeguarding sensitive data and maintaining the integrity of computations.
- Efficiency: Performance improvements and optimizations enhance the system’s ability to handle complex tasks, leading to faster and more accurate results.
- Innovation: By keeping the Roar HPC system up-to-date with the latest technologies, ICDS supports cutting-edge research and innovation across various fields.
This scheduled downtime, which is part of five annually scheduled outages, ensures that researchers across the University are always informed and can plan their work accordingly. The outages are also coordinated with Penn State IT, facilities and the Office of the Senior Vice President for Research (OSVPR). The collaboration ensures all upgrades are performed efficiently and hereby minimizing disruption to the Penn State research community.
“We work with the Office of Physical Plant (OPP) and the Data Center team to make sure they have what they need on a facilities level for what we are going to be doing. Managing the process has a project-based feel of what the team will be doing on the system, what they won’t be doing on the system and what would be the ideal order of events,” said Eric Huyett, research and development engineer and lead on ICDS downtime processes.
The downtime team includes Erik Byer, research and development engineer; Mike Gallo, systems engineer; Rob Groner, research and development software engineer; Derick Haigh, research and development engineer; Matt Hansen, research and development engineer; Huyett; Ross Mickens, design engineering team lead; Rob See, HPS consultant; and Gary Skouson, systems engineering team lead.
During these outages, team members identify specific issues such as non-functional parts, software glitches or performance bottlenecks. They collaborate to troubleshoot these problems, which can include replacing faulty processors, updating outdated software or optimizing network configurations. Additionally, they thoroughly examine hardware components like memory modules and storage devices to ensure everything works properly after the updates.
“ICDS works well due to our domain knowledge of cluster set up, automation tasking and intervention of system commands,” Haigh said. “It is best to not pull the rug off for everyone. Sometimes it is best to tell people to get off the rug. We don’t want to be de-bugging a core when everyone is on the system.”
The ICDS team’s dedication and proficiency ensure that the Roar supercomputer remains a reliable resource for Penn State researchers, allowing them to continue their groundbreaking work with minimal interruption. This downtime was not just a technical necessity but a demonstration of the power of teamwork and shared expertise in advancing the university’s research mission.
“ICDS works synchronously and in specialized formats,” Hansen, who also works with the Open on Demand Portal (OOD) and Virtual Machines (VR) structures, said. “Once the event is over, we always look at what we can do better. We have a follow up meeting to discuss the outage to make sure testing takes place in case a rollback needs to take place to not affect researcher workflows. ICDS is always looking for constant improvement in the process.”
Share
Related Posts
- Featured Researcher: Nick Tusay
- Multi-institutional team to use AI to evaluate social, behavioral science claims
- NSF invests in cyberinfrastructure institute to harness cosmic data
- Center for Immersive Experiences set to debut, serving researchers and students
- Distant Suns, Distant Worlds
- CyberScience Seminar: Researcher to discuss how AI can help people avoid adverse drug interactions
- AI could offer warnings about serious side effects of drug-drug interactions
- Taking RTKI drugs during radiotherapy may not aid survival, worsens side effects
- Cost-effective cloud research computing options now available for researchers
- Costs of natural disasters are increasing at the high end
- Model helps choose wind farm locations, predicts output
- Virus may jump species through ‘rock-and-roll’ motion with receptors
- Researchers seek to revolutionize catalyst design with machine learning
- Resilient Resumes team places third in Nittany AI Challenge
- ‘AI in Action’: Machine learning may help scientists explore deep sleep
- Clickbait Secrets Exposed! Humans and AI team up to improve clickbait detection
- Focusing computational power for more accurate, efficient weather forecasts
- How many Earth-like planets are around sun-like stars?
- Professor receives NSF grant to model cell disorder in heart
- SMH! Brains trained on e-devices may struggle to understand scientific info
- Whole genome sequencing may help officials get a handle on disease outbreaks
- New tool could reduce security analysts’ workloads by automating data triage
- Careful analysis of volcano’s plumbing system may give tips on pending eruptions
- Reducing farm greenhouse gas emissions may plant the seed for a cooler planet
- Using artificial intelligence to detect discrimination
- Four ways scholars say we can cut the chances of nasty satellite data surprises
- Game theory shows why stigmatization may not make sense in modern society
- Older adults can serve communities as engines of everyday innovation
- Pig-Pen effect: Mixing skin oil and ozone can produce a personal pollution cloud
- Researchers find genes that could help create more resilient chickens
- Despite dire predictions, levels of social support remain steady in the U.S.
- For many, friends and family, not doctors, serve as a gateway to opioid misuse
- New algorithm may help people store more pictures, share videos faster
- Head named for Ken and Mary Alice Lindquist Department of Nuclear Engineering
- Scientific evidence boosts action for activists, decreases action for scientists
- People explore options, then selectively represent good options to make difficult decisions
- Map reveals that lynching extended far beyond the deep South
- Gravitational forces in protoplanetary disks push super-Earths close to stars
- Supercomputer cluster donation helps turn high school class into climate science research lab
- Believing machines can out-do people may fuel acceptance of self-driving cars
- People more likely to trust machines than humans with their private info
- IBM donates system to Penn State to advance AI research
- ICS Seed Grants to power projects that use AI, machine learning for common good
- Penn State Berks team advances to MVP Phase of Nittany AI Challenge
- Creepy computers or people partners? Working to make AI that enhances humanity
- Sky is clearing for using AI to probe weather variability
- ‘AI will see you now’: Panel to discuss the AI revolution in health and medicine
- Privacy law scholars must address potential for nasty satellite data surprises
- Researchers take aim at hackers trying to attack high-value AI models
- Girls, economically disadvantaged less likely to get parental urging to study computers
- Seed grants awarded to projects using Twitter data
- Researchers find features that shape mechanical force during protein synthesis
- A peek at living room decor suggests how decorations vary around the world
- Interactive websites may cause antismoking messages to backfire
- Changing how government assesses risk may ease fallout from extreme financial events
- Symposium at U.S. Capitol seeks solutions to election security
- ICS co-sponsors Health, Environment Seed Grant Program
- Differences in genes’ geographic origin influence mitochondrial function
- Using social media to solve social problems- study funded by ICS seed grant