Research Computing Team Studies Supercomputer Reliability – HPCwire

May 26, 2020 Researchers running demanding computations, especially for projects like infectious disease modeling that need to be re-run frequently as new data becomes available, rely on supercomputers to run efficiently with as few failures of the software as possible. The more jobs that fail, the less science can get done.

Understanding why some jobs fail and what can be done to make supercomputers more reliable is the focus of a recent project led by Saurabh Bagchi, a professor of electrical and computer engineering, and ITaP senior research scientist Carol Song.

The project, which began almost five years ago and was supported by three awards from the National Science Foundation (award numbers 1405906, 1513051, and 1513197) totaling over $1.1 million, analyzed data from supercomputer systems at Purdue, as well as the University of Illinois at Urbana-Champaign and the University of Texas-Austin. At Purdue, theConteandHalsteadcommunity clusters were studied.

Among the conclusions Bagchi and Song have drawn:

Bagchi says these are practical takeaways that supercomputer systems administrators can implement to make applications run on their computers more reliably.

In addition to their own data analysis, Bagchi and Songs NSF grant funded the development of an open access repository known asFRESCO, where systems data from Purdues clusters and UT-Austins Stampede supercomputer is stored, as well as the teams conclusions and actionable suggestions for the people who run computer clusters. Theyve also included simple scripts that will let anyone run their own data analysis on the data from the three schools. A similar repository houses the data from the Blue Waters supercomputer located at the National Center for Supercomputing Applications at the University of Illinois.

We really want the computing community to benefit from this resource, says Bagchi, of the open source repositories.

Rajesh Kalyanam, a software engineer on Songs team, developed the technical infrastructure to collect data from supercomputers, and Stephen Harrell, a former ITaP scientific applications analyst, helped get the data from the Purdue clusters onto the FRESCO repository.

FRESCO not only serves the computer systems researchers designing more dependable systems, it also has the potential to help researchers develop and test new big data algorithms, as well as train students in applying data science methods on real-world datasets, says Song. We in ITaP Research Computing are collaborating with faculty on both fronts.

The team has published their findings in a recent paper to be presented at the upcomingDependable Systems and Networks conference, which will be held virtually in June. That papers first author is Rakesh Kumar, one of Bagchis former graduate students who is now employed at Microsoft. Ravishankar Iyer, the George and Ann Fisher Distinguished Professor of Engineering and professor of electrical and computer engineering at the University of Illinois, is the lead investigator from Ilinois. Other researchers on the team include Ashraf Mahgoub from Purdue; Saurabh Jha, Zbigniew Kalbarczyk, William T. Kramer from the University of Illinois; and Todd Evans and Bill Barth from the University of Texas.

Source: Adrienne Miller, Information Technology at Purdue (ITaP)

View original post here:

Research Computing Team Studies Supercomputer Reliability - HPCwire

Related Posts

Comments are closed.