Person:

Brownell, Kevin Matthew

Loading...
Profile Picture

Email Address

AA Acceptance Date

Birth Date

Research Projects

Organizational Units

Job Title

Last Name

Brownell

First Name

Kevin Matthew

Name

Brownell, Kevin Matthew

Search Results

Now showing 1 - 1 of 1
  • Publication

    Architectural Implications of Automatic Parallelization With HELIX-RC

    (2015-09-23) Brownell, Kevin Matthew; Brooks, D.; Wei, G.; Yang, W.

    As classic Dennard process scaling fades into the past, power density concerns have driven modern CPU designs to de-emphasize the pursuit of single-thread performance, focusing instead on increasing the number of cores in a chip. Computing throughput on a modern chip continues to improve, since multiple programs can run in parallel, but the performance of single programs improves only incrementally. Many compilers have been designed to automatically parallelize sequentially written programs by leveraging multiple cores for the same task, thereby enabling continued single-thread performance gains. One such compiler is HELIX, which can increase the performance of a mixture of SPECfp and SPECint benchmarks by 2X on a 6-core Nehalem CPU.

    Previous approaches to automatically parallelize irregular programs have focused on removing apparent dependences through thread-level speculation, which limits the type of code that can be targeted. In contrast, this dissertation increases the amount of code that can be parallelized by addressing the specific communication demands of that code. The dissertation proposes a special purpose extension of the cache hierarchy, called ring cache, to greatly reduce the perceived communication latency between cores running an automatically parallelized program. This co-design of ring cache and the HELIX compiler, called HELIX-RC, increases the speedup of 10 SPEC benchmarks running on 16 simulated in-order cores from an average of 2X to an average of over 8X. Speedups are slightly reduced to 7X on out-of-order cores, which extract instruction-level parallelism on their own. A fully synthesized Verilog implementation of ring cache is evaluated and is shown to consume less than 25mW of power with an area of less than 0.275 square millimeters.

    This dissertation includes a study comparing single program per core multiprogramming and HELIX-RC. Counterintuitively, some HELIX-RC parallelized benchmarks not only surpass simple multiprogramming in terms of single program performance, but can also beat multiprogramming in terms of total multicore throughput by reducing the effective per-core working set of a program.

    With communication bottlenecks removed by ring cache, automatic parallelization with HELIX-RC restores a decade of lost single-thread performance improvements.