SC07


SCHEDULE: NOV 10-16, 2007



Entire WeekSaturdaySundayMondayTuesdayWednesdayThursdayFriday
My Itinerary



Coordinated Fault Tolerance in High-end Computing Environments

Session: Coordinated fault tolerance in high-end computing environments

Event Type: Birds of a Feather

Time: 12:15pm - 1:15pm

Session Chair: Peter Beckman

Leader(s): Pete Beckman, Rinku Gupta, Al Geist

Location: A3 / A4

Abstract:
The ability to detect and recover from faults on large HPC systems would be greatly aided by a standardized interface to exchange fault information. A standard framework where any component of the software stack can report or be notified of faults through a common interface enables coordinated fault tolerance and recovery. This BOF will present the draft design of such an interface for comment by the HPC community, both users and vendors.

The objectives of this BOF session are:

(1) To have an open discussion about the usefulness, impact, and adoption of a comprehensive fault-tolerance framework in enterprise and research environments

(2) To better understand fault management and fault-tolerance challenges being faced in todays environment

(3) To bring together individuals dealing with high-end, petascale computing infrastructures, who have an interest in developing and tolerance in high-end computing environments




Chair/Leader Details:

Peter Beckman (Chair)
Argonne National Laboratory

Pete Beckman
Argonne National Laboratory

Rinku Gupta
Argonne National Laboratory

Al Geist
Oak Ridge National Laboratory




     Home  |  About  |  Contact Us  |  Registration ACM    IEEE    The Computer Society