Monitoring should allow interested parties to make intelligent, pro-active decisions regarding infrastructure.
Monitoring should also provide a very detailed, real-time vantage point of use and health.
While some may care whether a specific piece of hardware has failed, others may only care how long their job is going to run. While these are related at a high level, they aren't similar in details. As we design our clusters to be resilient to component failure, any one or two failures should not impact the average performance of a cluster.
Performance monitoring and data trending gives engineers and management valuable insight into a broad range of system level issues. From job performance, did changes in code have an adverse impact to performance? Are clusters sized appropriately to expectations? Have more jobs been submitted to the same resource pool?
- ganglia http://local.compbio.washington.edu/ganglia