Understanding the Propagation of Silent Data Corruption in Algebraic Multigrid

By Jon Calhoun

University of Illinois at Urbana-Champaign

Published on

Abstract

Sparse linear solvers from a fundamental kernel in high performance computing (HPC). Exascale systems are expected to be more complex than systems of today being composed of thousands of heterogeneous processing elements that operate at near-threshold-voltage to meet power constraints. The combination of near near-threshold-voltage and number of processing elements required to reach exascale increases the rate of silent data corruption (SDC). With the rate of SDC expected to be higher, understanding how error propagates in HPC applications becomes vital to devise efficient detection and recovery schemes. In this talk, we investigate how SDC occurring in fixed-point and floating-point instructions propagates in the linear solver algebraic multigrid (AMG). We discover that SDC occurring on the coarsest levels have the most impact on convergence requiring extra iterations in a higher percentage than on the finest levels.

Cite this work

Researchers should cite this work as follows:

  • Jon Calhoun (2016), "Understanding the Propagation of Silent Data Corruption in Algebraic Multigrid," https://nanohub.org/resources/23483.

    BibTex | EndNote

Submitter

NanoBio Node, Aly Taha

University of Illinois at Urbana-Champaign

Tags