HPC-Colony Project

Adaptive System Software For Improved Resiliency and Performance

 

 

 

Overview Goals Accomplishments FAQ News Participants Publications Links Internal Page

 

Overview

The HPC-Colony project is a joint research effort with Oak Ridge National Laboratory, the IBM T.J. Watson Research Center and Haifa Research Center, and the University of Illinois at Urbana-Champaign to create scalable Services and Interfaces that permit BOTH scalable high performance AND easy application porting for high-performance computing (HPC) systems with very large numbers of processors. Funding for the HPC-Colony Project is provided by a grant from the U.S. Department of Energy Office of Science.

The motivation for the HPC-Colony Project is to make portable performance a reality. Today, domain scientists must considerable effort to increase their application's efficiency on a particular machine architecture. Colony is developing system software that dramatically reduces the burden placed upon domain scientists by shifting much of the tuning to adaptive system software. Moreover, the application tuning undertaken to run efficiently on one leadership class machine can migrate to new machines.

Our approach relies on addressing three critical HPC areas:

  • Parallel resource management
    • Difficulties in scheduling workloads and achieving balanced partitioning can limit scaling for complex problems on large machines. We utilize automatic and adaptive load-balancing plus fault tolerance.
  • Communication services
    • We provide high performance communication services for membership, publish/subscribe overlays (multicast), and convergecast.
  • Advanced Kernel work
    • We address issues with Linux to provide the familiarity and performance needed by domain scientists. Among our advances is Colony's coordinated scheduling.

Ever increasing numbers of processors and the inherent restrictions found in today's system software impose artificial barriers upon the capacity of our most capable HPC machines. For developers to be able to scale applications to these new processor counts, work is needed to make system software free of imbalances and scaling shortcomings. Moreover, the arduous task of balancing an application is best accomplished using dynamically enforced schemes with global knowledge -- a new opportunity for system software. Indeed, system software improvements are needed to provide important benefits to users of HPC systems:

  • provide higher levels of application scalability; specifically, remove the problems associated with operating system interference (noise) as well as the problems associated with application load imbalances
  • permit application porting without syscall modifications
  • support familiar tools including a wide range of debugging and development tools on compute nodes
  • provide dynamic support for multiple management policies
  • provide support for fault tolerance
  • provide parallel awareness and optimization

The Colony project is developing a coordinated framework using Linux and the Charm++ run-time system to bring about these HPC goals for the benefit of parallel applications.

 


For further information on the Colony Project, contact Terry Jones (email trj@ornl.gov)


Funding for the HPC-Colony Project is provided by a grant from
the U.S. Department of Energy Office of Science.