HPC Colony Performance/Scaling Measurements
(citations available here) |
Charm Adaptive Load Balancing |
|
|
Figure 1: Scaling of NAMD on ORNL's Jaguar under different configurations. The
results show excellent scaling derived from our hierarchical load-balancer for
a 100 million atom test on Jaguar running on 224,000 cores. |
Figure 2: Measurements obtained from the BRAMS weather forecasting model. The
top 2 figures demonstrate that more load is present in zones experiencing storms,
and the bottom two figures illustrate how our load-balancing algorithms are able
to redistribute the work for improved performance. |
Charm Improved Checkpoint/Restart |
|
|
Figure 3: Scaling of Charm's In-Memory checkpoint-restart scheme in
forward direction (overhead in the event of no failure). Using O(n) algorithms for
computation and O(1) algorithms for communication, the scheme is able to scale to 64K
with under 6% overhead.
| Figure 4: Causal message logging performance in the face of a failure.
The figure plots the application progress, in terms of completed iterations, as a function
of elapsed time. In the checkpoint-restart case, the work of a few iterations (i.e. 100 to
140) needs to be redone when the failure occurs; meanwhile, with causal message-logging,
only the failing processor requires its work to be repeated, and other processors that do
not depend on it can proceed. |
Coordinated Scheduling Kernel |
|
|
Figure 5: Measurements obtained from Colony's synchronized clock algorithm
exhibit excellent scaling characteristics. |
Figure 6: Coordinated and uncoordinated schedulings. The above figure portrays
a histogram of runs with and without coordinated scheduling. The lower histogram includes
coordinated scheduling which results in much lower variability of synchronizing collective
operations. |
SpiderCAST communications infrastructure |
|
|
Figure 7: We measure the time for join-events to propagate through
the overlay, on different view sizes and different number of nodes joining.
Results indicate that a 512-node zone propagates 16 concurrent joins to every member
at TJoin(512,16)~0.6s; Moreover the number of nodes joining concurrently has a very
small effect on the latency. |
Figure 8: Similarly, we measure the time for leave-events to
propagate through the overlay, on different view sizes and different number of nodes
leaving. Results indicate that a 512-node zone propagates 16 concurrent leaves to
every member at TLeave(512,1)~0.35s; As with the case of joining, the number of nodes
leaving concurrently has a very small effect on the latency. |