

Despite the capped growth of the peak CPU speed, the aggregate speed of a CMP keeps increasing as more cores get into a single chip.

The second hardware more » trend is the growing gap between memory bandwidth and the aggregate speed-that is, the sum of all cores' computing power-of a Chip Multiprocessor (CMP). Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. The first is the increasing sensitivity of processors' throughput to irregularities in computation. The development of modern processors exhibits two trends that complicate the optimizations of modern software.

#Berkeley upc benchmark uts code#
We compare our approach with other PGAS models, such as UPC running using GASNet, and hand-optimized MPI code on a set of typical large-scale irregular applications, demonstrating speedups of an order of magnitude. A key innovation in the GMT runtime is its thread specialization (workers, helpers and communication threads) that realize the overall functionality. It implements multi-level aggregation and lightweight multithreading to maximize memory and network bandwidth with fine-grained data accesses and tolerate long data access latencies. GMT integrates a PGAS data substrate with simple fork/join parallelism and provides automatic load balancing on a per node basis. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated = (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. We identify some of the challenges in compiling UPC and use a combination of micro-benchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations.
#Berkeley upc benchmark uts portable#
In this paper we describe a portable open source compiler for UPC. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space.
