•  2
    Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
    with M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, J. da PattersonShalf, and K. A. Yelick
  • Productivity and performance using partitioned global address space languages
    with K. Yelick, D. Bonachea, W. Y. Chen, P. Colella, J. Duell, S. L. Graham, P. Hargrove, P. Hilfinger, P. Husbands, C. Iancu, A. Kamil, R. Nishtala, J. Su, M. Welcome, and T. Wen
    Partitioned Global Address Space languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of JavaTM designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titaniu…Read more
  • Parallel languages and compilers: Perspective from the Titanium experience
    with K. Yelick, P. Hilfinger, S. Graham, D. Bonachea, J. Su, A. Kamil, P. Colella, and T. Wen
    We describe the rationale behind the design of key features of Titanium-an explicitly parallel dialect of Java for high-performance scientific programming-and our experiences in building applications with the language. Specifically, we address Titanium's partitioned global address space model, single program multiple data parallelism support, multi-dimensional arrays and array-index calculus, memory management, immutable classes, operator overloading, and generic programming. We provide an overv…Read more
  •  1
    Auto-Tuning the 27-point Stencil for Multicore
    with S. W. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, and K. Yelick
    This study focuses on the key numerical technique of stencil computations, used in many different scientific disciplines, and illustrates how auto-tuning can be used to produce very efficient implementations across a diverse set of current multicore architectures.
  •  2
    Auto-tuning Stencil Computations on Multicore and Accelerators
    with S. W. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, and K. Yelick
  •  3
    Auto-Tuning Memory-Intensive Kernels for Multicore
    with S. W. Williams, L. Oliker, J. Carter, J. Shalf, and K. Yelick
  •  2
    Implicit and explicit optimizations for stencil computations
    with S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick
    Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, in…Read more