Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

Authors: 
D. Li, J. Vetter and W. Yu
Name of Publication: 
Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012
Type of Publication: 
Conference Proceedings
Abstract: 
Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase. In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool - BIFIT - that allows us to evaluate how soft errors impact applications. In particular, BIFIT is designed with capability to inject faults at very specific targets: an arbitrarily-chosen execution point and any specific data structure. We apply BIFIT to three mission-critical scientific applications and investigate the applications vulnerability to soft errors by performing thousands of statistical tests. We, then, classify each applications individual data structures based on their sensitivity to these vulnerabilities, and generalize these classifications across applications. Subsequently, these classifications can be used to apply appropriate resiliency solutions to each data structure within an application. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error; yet, we are able to identify intrinsic relationships between application vulnerabilities and specific types of data objects. In this regard, BIFIT enables new opportunities for future resiliency research.
URL of Published Paper: 
http://dl.acm.org/citation.cfm?id=2389074&dl=ACM&coll=DL&CFID=176480260&CFTOKEN=32696895
Conference Location: 
Salt Lake City, Utah
Publisher: 
IEEE
Published Date: 
November, 2012
Research Areas: