Abstract—Middleware for parallel computing systems incorporate checkpointing to achieve fault tolerance. Most traditional checkpointing approaches tend to be less dynamic in large scale parallel computing environments. Hence, there arises a need for an adaptive and dynamic approach. The work reported in this paper, proposes a multi-agent based approach for fault tolerance. Five resources namely, the executed problem, parallel computing platform, middleware, hardware abstraction and agents that contribute towards the infrastructure of the proposed approach is considered. The approach is implemented on a computer cluster and experimental results are presented to validate the feasibility of the approach and its contribution towards enhancing fault tolerance.
Index Terms—middleware approach, multi-agent, fault tolerance, parallel computing systems.
Gerard McKee is Senior Lecturer in Networked Robotics, School of Systems Engineering, University of Reading, Whiteknights Campus, Reading, Berkshire, United Kingdom, RG6 6AY, email: g.t.mckee@reading.ac.uk.
Blesson Varghese is a PhD candidate with the Active Robotics Laboratory, School of Systems Engineering, University of Reading, Whiteknights Campus, Reading, Berkshire, United Kingdom, RG6 6AY, email: b.varghese@student.reading.ac.uk.
Vassil Alexandrov is Professor in Computational Science, School of Systems Engineering, University of Reading, Whiteknights Campus, Reading, Berkshire, United Kingdom, RG6 6AY, email: v.n.alexandrov@reading.ac.uk.
[PDF]
Cite: Gerard McKee, Blesson Varghese and Vassil Alexandrov, "A Transition from Traditional Checkpointing towards Multi-Agent based Approaches,"
International Journal of Computer Theory and Engineering vol. 2, no. 5, pp. 701-705, 2010.