|
|
Fault Tolerance Techniques for the Merrimac Streaming Supercomputer
Mattan Erez, Nuwan Jayasena, Timothy J. Knight, and William J. Dally
in Proceedings of the SC|05 Conference, November 12-18 2005, Seattle, Washington, USA
Abstract:
As device scales shrink, higher transistor counts are available while
soft-errors, even in logic, become a major concern. A new class of
architectures, such as Merrimac and the IBM Cell, take advantage of
the higher transistor count by exposing control, communication, and a
large number of functional-units at the architectural level, thus
achieving high performance and efficiency. This paper explores
soft-error fault tolerance in the context of these compute-intensive
architectures, which differ significantly from their
control-intensive CPU counterparts. The main goal of the proposed
schemes for Merrimac is to conserve the critical and costly off-chip
bandwidth and on-chip storage resources, while maintaining high peak
and sustained performance. We achieve this by allowing for
reconfigurability and relying on programmer input. The processor
is either run at full peak performance employing software
fault-tolerance methods, or reduced performance with hardware
redundancy. We present several methods, their analysis, and detailed
case studies.
Paper:
Adobe Acrobat PDF
BibTeX:
@conference{ref:sc05_faulttolerance,
author = {Mattan Erez and Nuwan Jayasena and Timothy J. Knight and William J. Dally},
title = {{Fault Tolerance Techniques for the Merrimac Streaming Supercomputer}},
booktitle = {{SC}|05},
year = {2005},
address = {Seattle, Washington, USA},
month = {November},
day = {12--18}
}
(c) ACM, 2005. This is the author's version of the work. It is
posted here by permission of ACM for your personal use. Not for
redistribution. The definitive version was published in the Proceedings of
SC|05, November 12--18, 2005, Seattle, Washington, USA.
Last modified: Mon Oct 10 13:00:58 PDT 2005
|