Sobiad Atıf Dizini

Article Detail

An Evaluation of Major Fault Tolerance Techniques Used on High Performance Computing (HPC) Applications

2023

Journal:

International Journal of Intelligent Systems and Applications in Engineering

Author:

Abstract:

Abstract High performance computing have a high number of constituent components used to facilitate data movement. Key characteristics of these systems include parallel processing, large memory, multiprocessor or multimode communication, and parallel file systems. Though they can turnaround computing in scenarios that need maximum processing power, HPCs face many challenges, key among them being fault tolerance. Today, most applications deal with faults by noting checkpoints frequently. Whenever a fault occurs, all the processes are terminated, and the task is loaded once again from the last checkpoint. Most applications deal with faults by noting checkpoints frequently. Whenever a fault occurs, all the processes are terminated, and the task is loaded once again from the last checkpoint. Key fault tolerance techniques used on HPC applications (reactive and proactive) were evaluated in this paper. Reactive protocols discussed include checkpointing/ restarting, replication, retry, and SGuard, while proactive techniques include preemptive migration, software rejuvenation, and self-healing strategy. As seen from the discussion on the drawbacks of each approach, efficient management of faults can best be achieved by using a hybrid system applying proactive and reactive measures simultaneously.

Keywords:

Citation Owners

Information: There is no ciation to this publication.

International Journal of Intelligent Systems and Applications in Engineering

Field : Mühendislik

Journal Type : Uluslararası

Metrics

Article : 1.632

Cite : 489

2023 Impact : 0.054

Details

International Journal of Intelligent Systems and Applications in Engineering

Abstract
Listen the Abstract

Author : --

Journal :

Issue

Year

Type

Citation Count

View PDF

Relevant Articles
Article Who Cited This Publication

Relevant Articles	Author	#

Article	Author	#

User Guide

Menu

Mendeley

Endnote

An Evaluation of Major Fault Tolerance Techniques Used on High Performance Computing (HPC) Applications

2023

Journal:

International Journal of Intelligent Systems and Applications in Engineering

Author:

Abstract:

Keywords:

Citation Owners

Information: There is no ciation to this publication.

Similar Articles

International Journal of Intelligent Systems and Applications in Engineering

Metrics