Lib4U

‎"Behind every stack of books there is a flood of knowledge."

Parallel Computer Architecture

cover

Course Lecture Plan


Permission is granted to copy and distribute this material for educational purposes only, provided that the complete bibliographic citation and following credit line is included: “Copyright (C) 2002 UCB.” Permission is granted to alter and distribute this material provided that the following credit line is included: “Adapted from (complete bibliographic citation). Copyright (C) 2000 UCB.

This material may not be copied or distributed for commercial purposes without express written permission of the copyright holder.


 

Wk   Date  Lec
No.
Lecture Topic Notes Reading Assignment
1 W 1/23 1 Introduction to Parallel Architecture [ppt,ps,pdf]    
2 M 1/28 2 Convergence of Parallel Architectures [ppt,ps,pdf]  
 W 1/30 3 Interconnection Networks I [ppt,ps,pdf]    
3 M 2/4 4 Interconnection Networks II: Topology and Routing [ppt,ps,pdf]   Reading #1 due
W 2/6 5 Interconnection Networks III: Routing [ppt,ps,pdf]   Reading #2 due
4 M 2/11 6 Interconnection Networks IV: Routing and Logical Effort [ppt,ps,pdf]   Reading #3 due
W 2/13 7 More interconnection networks (fault tolerance)  <No Slides>   Reading #4 due
5 M 2/18 < President’s Holiday: No Classes >
W 2/20 8 Network Interface Designs [ppt,ps,pdf] Reading #5 due
6 M 2/25 9 Discussion of Active Messages/iWarp [AM:ppt,ps,pdf]
[iWarp:ppt,ps,pdf]
Reading #6 due
W 2/27 10 Discussion of LogP model/RAW Operand Network [LogP:ps,pdf]
[RAW:ppt,ps,pdf]
Reading #7 due
7 M 3/3
11 Discussion of WaveScalar and Tripps
[WaveScalar:ps,pdf]
[Trips:ppt,ps,pdf]
Reading #8 due
W 3/5 12 Discussion of Merrimac and Cell Processors [Merrimac:ppt,ps,pdf]
[Cell:pptx,ps,pdf]
  Reading #9 due
8 M 3/10 13 Shared Memory Multiprocessors [ppt,ps,pdf   Reading #10 due
W 3/12 14 Shared Memory Multiprocessors (Con’t). Sequential Consistency via Reordering. [ppt,ps,pdf]
[SCILP:ppt,ps,pdf]
  Reading #11 due
9 M 3/17 15 Snoopy Caches I [ppt,ps,pdf]
[DASH:ppt,ps,pdf]
Reading #12 due
W 3/19 16 Snoopy Caches II, Limited Directory Machines,
LimitLESS Coherence, Zoo of Bus Based Protocols
[ppt,ps,pdf]
[LimitLESS:ppt,ps,pdf]
[Snoopy:ppt,ps,pdf]
  Reading #13 due
10 M 3/24 Spring Break
No classes this week!
W 3/26
11 M 3/31 17 DDM machine, Numa vs Coma
[DDM:ppt,ps,pdf]
[NumaComa:ppt,ps,pdf]
  Reading #14 due
W 4/2 18 Flash Multiprocessor, Protocols for Migratory Data [FLASH:ppt,ps,pdf]
[Migrate:ppt,ps,pdf]
  Reading #15 due
12 M 4/7 19 Checkpoint recovery in shared memory [ReVive:ppt,ps,pdf]
[SafetyNet:ppt,ps,pdf]
  Reading #16 due
W 4/9 20 No class. Individual Project Meetings <No slides>
13 M 4/14 21 Coherence Protocol Architecture Wrapup [ppt,ps,pdf]
W 4/16 22 Special guest lecture: Anant Agarwal talks about Tilera
Location: 3108 Etcheverry
<No Slides>
Supplementary Reading
Reading #17 due
14 M 4/21 23 Synchronization [ppt,ps,pdf] Reading #18 due
W 4/23 24 MCS Locks, Reactive Synchronization [MCS:pptx,ps,pdf]
[Reactive:ppt,ps,pdf]
Reading #19 due
15 M 4/28 25 Software Transactional Memory (STM),
Transactional Coherence and Consistency (TCC)
[STM:ppt,ps,pdf]
[TCC:ppt,ps,pdf]
Reading #20 due
W 4/30 26 Unbounded Transactional Memory (UTM),
Virtual Transactional Memory (VTM)
[UTM:ppt,ps,pdf]
[VTM:pptx,ps,pdf]
Reading #21 due
16 M 5/5 27 LogTM, Hybrid Transactional Memory [LogTM:ppt,ps,pdf]
[Hybrid:ppt,ps,pdf]
Reading #22 due
W 5/7 28 MultiScalar Processor, Hydra CMP [MultiScalar:ppt,ps,pdf]
[Hydra:ppt,ps,pdf]
Reading #23 due
7 M 5/12 29 Memory Systems for CMPs, Messages vs Shared Memory [CMPMem:ppt,ps,pdf]
[MVM:ppt,ps,pdf]
Reading #24 due
W 5/14 30 Project Presentations (1:00-4:00pm, 310 Soda)

General:

  • ISCA 25-year retrospective (only within Berkeley): [ps,pdf]

Reading Assignments: (Final List of Papers: [html])

Homework Assignments:

  • Nothing yet

Quiz:

  • Nothing yet

Useful Links

Textbook:

Parallel Computer Architecture:
a Hardware/Software Approach
David E. Culler, University of California, Berkeley; 
Jaswinder Pal Singh, Princeton University; 
with Anoop Gupta, Stanford University Morgan Kaufmann Publishers

Additional Reading:

There will be no formal reader for this class. Required and recommended papers will be distributed in class; extras will be available at a place TBA.

One interesting resource that is now available (only to local Berkeley Hosts) is the ISCA 25-year retrospective in 

postscript

 and 

pdf

. Several of the papers that we will be reading this term have retrospectives writen by the original authors that appar in this volume.

Search Utilities

Tools

    • Tools: a short story, by Remzi Arpaci, briefly describes pixie, pixstats, prof, dinero, qpt, CPROF, spim, Cacti, shade, and spixtools. Most of the specific directories and files mentioned are on the instructional machines.
    • Wisconsin Architectural Research Tool Set (WARTS) – including QPT, QPT2, CPROF, Tycho, dineroIII, and EEL
    • ATOM – A toolkit that can be used for tracing, and much more. Only runs on Alphas. Log in to either saidin.eecs (an instructional machine) or speeding.cs (on the same file system as the NOW machines)
    • EEL – A toolkit that can be used for tracing, and much more. Only runs on SPARCs.

 

      See the following

class project

     for some more details.

  • Etch – A tracing tool for Intel x86 platforms running either Win95, WinNT, or Linux. Courtesy of Harvard University and the University of Washington.
  • Instruction-Level Simulation And Tracing – This contains a huge list of simulators, emulators, and tracing tools, including an extensive bibliography as well as list of people to contact

Benchmarks

  • The Performance Database Server
  • Benchmarks FAQ – from the newsgroup comp.benchmarks
  • Benchmark Warehouse Center
  • SPEC benchmarks – benchmark suite from the Standard Performance Evaluation Corporation
  • bench – network benchmark program for IP networks that measures bulk transfer throughput and round trip delays using either TCP/IP or UDP/IP
  • lmbench – micro-benchmark suite that measures system latency and bandwidth of data movement among the processor and memory, network, file system, and disk
  • HBench-OS – an improved version of lmbench that fixes bugs, generates more statistically-accurate results, and includes support for the Pentium and Pentium Pro performance monitoring counters
  • ttcp – times transmission and reception of data between two systems using UDP and TCP protocols

Traces

    • Etch Traces – user level traces of Windows NT applications collected using Etch.
    • Etch Traces – another link.
    • BYU Traces
    • Monster Traces – Traces of 8 applications (run under Ultrix and Mach) collected with a hardware monitor on a DEC-MIPS workstation. Includes user and kernel activity. First presented in a paper which appeared in ISCA95. Courtesy of Richard Uhlig (thanks to Christoforos for getting them). More details forthcoming when these are actually available.
    • Patchwrx – Single-processor portions of traces, containing user and kernel references, from 2 spec benchmarks and a database, running under Windows NT on an Alpha. Courtesy of Dick Sites, from DEC. A few projects from last spring’s IRAM class as well as a Spring 1996 CS 252 projects detail some caveats concerning the accuracy of these traces (and of Dick Sites’ conclusions regarding “Caches Don’t Work”.)
    • New Mexico State Univ Trace Database
    • Internet Traffic Archive– contains 5 traces of TCP/ethernet traffic, 6 traces of requests received by specific web servers, 2 traces of web client requests (from some pool of clients), and a set of traceroute measurements. Traces can span hours or weeks and may occupy multiple megabytes uncompressed.

 

    More info from Westley Weimer: “The web requests (the only ones I have examined in detail) tend to be long text files of tuples: (client-ip, time, document requested, size, etc) with the exact fields and format varying by trace. One trace even comes with a C interface for reading their records. ”

    Simulators

      • The SimpleScalar simulator is a very flexible simulation package that comes with a working version of GCC and precompiled  versions of the SPEC 95 benchmark suite.  SimpleScalar comes with several different simulators, each of which supports a different level of simulation: everything from timing-free but fast simulation (for debugging), to a full out-of-order simulator.  The simulated processor is a variant of the MIPS architecture.

     

      For a precompiled version that works on Linux X86, see Kubiatowicz’s home page: ~kubitron/simplescalar.

      CPU info

      • The CPU Info Center has a good summary of high-level info
      • MIPS does a very good job of providing on-line documentation
      • Intel’s web page is generally full of marketing crap, but there’s a neat Intel Secrets page, which is subtitled “What Intel Doesn’t Want You To Know”
      • I have some hardcopy documentation on DEC Alpha microprocessors. I am willing to briefly loan them for making selective copies, or I can provide you with information as to how you can order (for free) your own copy.

      Related Course Pages


      Other Useful Links

      CS258: Spring 2008 Final Projects

      This page contains pointers to the final CS258 project pages for Spring of 2008. These projects are done in groups of two or three and span a wide range of topics. To see the original list of suggested projects, look HERE.

      For comments information about requirements for the final paper, look HERE


      1:   Hybrid Electric/Photonic Networks for Scientific Applications on Tiled CMPs
      Ankit Jain and Shoaib Kamil and Marghoob Mohiyuddin
      As multiprocessors scale to unprecedented numbers of cores in order to sustain performance growth, it is vital that the gains in speed not come with increasingly high energy consumption. Recent advances in 3D Integration (3DI) CMOS technology have made possible hybrid photonic networks-on-chip (NoC), which have the potential to result in high performance while consuming much less power than an equivalent electrical network. However, it remains to be seen whether the benefits of hybrid NoCs will carry over for real applications. Our work is the first attempt at a comparison of hybrid NoCs with electrical networks using both synthetic benchmarks as well as real scientific applications. We describe analytical models for the two networks as well as insights from simulation studies. Results show that the hybrid NoCs outperform electrical NoCs both in terms of performance and energy consumption, as long as the communications are sufficiently large to amortize the increased latency costs. Lastly, this work demonstrates the importance of finding good process-to-processor mappings in order to obtain high performance while reducing energy consumption. Overall, results illustrate the potential benefits of hybrid photonic networks for future manycore chips.
      Supporting Documentation: Final Report (pdf) Slides (ppt)
      2:   L2 to Off-Chip Memory Interconnects for CMPs
      Allen Lee and Daniel Killebrew
      Chip multiprocessors (CMP) are at the heart of the oncoming sea change in computing. Placing many simple cores on a single die is predicted to be the answer to the power and frequency walls that the dominant design trends of the past couple decades have hit. CMPs will make new workloads possible; workloads that place a greatly increased strain on off-chip memory bandwidth. Some of these have working sets that simply cannot fit inside a conventional cache. Most of these have bandwidth requirements that scale exponentially with the number of cores, as the performance of these algorithms scales linearly with number of cores. Providing sufficient off-chip bandwidth will be an engineering challenge that will allow for the full potential of CMPs to be harnessed. 

      We propose to examine several modern multicore chips, analyzing design trends in the off-chip memory interconnect. We will provide a framework for analysis and evaluation of current methodologies, and then present an alternative to the current interconnect solutions. Our proposal takes advantage of previous work done on network interconnects, leveraging key ideas formulated in our case studies.

      Supporting Documentation: Final Report (pdf) Slides (ppt)
      3:   Multicore Memory Protection With Hardware Labels
      Sarah Bird and David McGrogan
      We plan to create an alternative to Mondriaan memory protection, which is a single-core interprocess memory sharing system, using the labeling tenets of HiStar, an information flow-controlling security OS, in a multicore architecture. We will be offloading some aspects of the HiStar OS to hardware in order to increase speed and capacity.
      Supporting Documentation: Final Report (pdf) Slides (ppt)
      4:   SHIFT+M Software Hardware Information Flow Tracking on Multi-core
      Colleen Lewis and Cynthia Sturton
      We simulated message passing information flow tracking in Simics, a multicore full system simulator. Our implementation provides both protection from unauthorized communication and a bounding on the impact an adversarial application can have on the system. We present the performance results of our micro-benchmarks that simulate the expected behavior for a web server and a number of potential adversarial workloads.
      Supporting Documentation: Final Report (pdf) Slides (ppt)
      5:   Code generation from PN models to Parallel code
      Isaac Liu and Jia Zou
      Process Network is a distributed model of computation(MoC) where a group of processing units are connected by communication channels to form a network of processes. Nodes can be scheduled statically (at compile time) onto single or parallel programmable processors so the run-time overhead usually associated with data flow evaporates. The process network model of computation inherits properties which we believe that makes it easily parallelizable. Thus, a main focus of our project is to analyze how to effectively parallelize a process network model. 

      In the Ptolemy project, a code generation framework has been developed for process networks. Ptolemy is a graphical design interface similar to Simulink or Labview, where processes are expressed as actors, and the firing of actors is dependent on the model of computation designated by the user.

       

      Currently, the code generation framework allow any model expressed using the synchronous dataflow MoC to be generated into C code. However, this framework was implemented where it is assumed that the target machine has only a single processor, which is becoming increasingly untrue. We will use this framework for testing and modify the code generation framework to produce parallel code for shared memory architectures and/or message passing architectures (openMP or MPI). We will evaluate the performance of the different architectures, and also compare it with the sequential single-core implementation to see the speed up and performance gain. We have accounts to machines in the National Energy Research Scientific Computing Center (NERSC) which will provides us the execution platform for our generated parallel code. We will use timers to measure the run time and compare the performance.

      Supporting Documentation: Final Report (pdf) Slides (ppt)
      6:   Using vector capabilities of the GPUs to accelerate FFT
      Vasily Volkov and Brian Kazian
      In this work we present a novel implementation of FFT on GeForce 8800GTX that achieves 144 Gflop/s that is nearly 3x faster than the best rate achieved in the current vendor`s numerical libraries. This performance is achieved by exploiting the Cooley-Tukey framework to make use of the hardware capabilities, such as the massive vector register files and small on-chip local storage. We also consider performance of the FFT on few other platforms.
      Supporting Documentation: Final Report (pdf) Slides (ppt)
      7:   Accelerating Machine Learning Applications on Graphics Processors
      Narayanan Sundaram
      The Scientific Computing community has successfully used parallel programs to get better performance. But the recent shift to multicore and manycore architectures is forcing other application developers to redesign their applications for parallelism. Machine Learning is one area that can benefit from this shift. 

      Graphics Processing Units (GPUs) have emerged as a new class of massively parallel and high performance architectures. They are however marred by a difficult programming model and uncommon architectural choices(relative to general purpose CPUs). NVIDIA’s CUDA is a programming model that seeks to make programming GPUs simpler.

       

      I have considered Support Vector Machine training and classification problems as case studies to demonstrate how parallelism can drive the choice of algorithms and lead to better implementations of existing algorithms. Architectural features of GPUs that pushed us towards our choice of algorithms and implementations are discussed. Significant speedups have been achieved on these algorithms using GPUs.

      Supporting Documentation: Final Report (pdf) Slides (ppt)
      8:   Address Translation for Manycore
      Scott Beamer and Henry Cook
      Multiprocessor systems present a challenge for computer architects because they are by their nature designed to concurrently update machine state. This leads to a fundamental struggle throughout the memory hierarchy between replicating state and providing coherence for all the replicas, versus simply synchronizing on a single shared copy. One type of stored state critical to performance is the physical address translation and protection information (TLB) for a location in virtual memory. In a large multiprocessor system, if translation information for a shared memory location is not replicated, the translation information itself may become a memory hot spot. However, if multiple copies of translation information are to exist, we must ensure that they are kept consistent with each other and with the shared page table structure. Achieving this correctness in an efficient way in the context of ParLab’s scalable, manycore architecture, with both dynamic hardware partitioning and cheap crossing of protection domains, presents a significant challenge. We evaluate multiple schemes performance and costs with the aid of software simulation. We use Simics to simulate a system with up to 128 cores, and are using some of the PARSEC benchmarks to run real parallel workloads. Since TLB information is invalidated rarely, we found the schemes with a fast common case performed much better, and had similar performance. Choosing between these better schemes will probably be done on a cost and especially energy basis.
      Supporting Documentation: Final Report (pdf) Slides (ppt)

       

      [ICO] Name Last modified Size Description

      [DIR] Parent Directory
      [   ] project1_report.pdf 19-May-2008 22:43 800K
      [   ] project1_talk.ppt 19-May-2008 22:43 2.8M
      [   ] project2_report.pdf 19-May-2008 23:59 1.6M
      [   ] project2_talk.ppt 19-May-2008 23:59 889K
      [   ] project3_report.pdf 20-May-2008 00:32 443K
      [   ] project3_report_ver2..> 20-May-2008 00:37 444K
      [   ] project3_talk.ppt 20-May-2008 00:29 1.2M
      [   ] project3_talk_ver2.ppt 20-May-2008 00:32 1.2M
      [   ] project4_report.pdf 19-May-2008 22:16 542K
      [   ] project4_talk.ppt 19-May-2008 22:16 599K
      [   ] project5_report.pdf 20-May-2008 00:24 639K
      [   ] project5_talk.ppt 20-May-2008 00:24 2.5M
      [   ] project6_report.pdf 19-May-2008 23:55 1.3M
      [   ] project6_talk.ppt 19-May-2008 20:10 434K
      [   ] project7_report.pdf 19-May-2008 23:20 295K
      [   ] project7_report_ver2..> 21-May-2008 22:56 248K
      [   ] project7_talk.ppt 19-May-2008 21:16 2.1M
      [   ] project7_talk_ver2.ppt 19-May-2008 23:20 2.1M
      [   ] project8_report.pdf 20-May-2008 00:03 612K
      [   ] project8_report_ver2..> 20-May-2008 00:07 613K
      [   ] project8_report_ver3..> 20-May-2008 14:05 612K
      [   ] project8_talk.ppt 20-May-2008 00:07 2.6M
      [   ] project8_talk_ver2.ppt 20-May-2008 00:07 2.6M
      [   ] project8_talk_ver3.ppt 20-May-2008 00:07 2.6M

      Source:

      http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/index.html

      http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/

      Leave a Reply

      Fill in your details below or click an icon to log in:

      WordPress.com Logo

      You are commenting using your WordPress.com account. Log Out / Change )

      Twitter picture

      You are commenting using your Twitter account. Log Out / Change )

      Facebook photo

      You are commenting using your Facebook account. Log Out / Change )

      Google+ photo

      You are commenting using your Google+ account. Log Out / Change )

      Connecting to %s

      Virtual Fashion Technology

      Virtual Fashion Education

      toitocuaanhem

      "chúng tôi chỉ là tôi tớ của anh em, vì Đức Kitô" (2Cr 4,5b)

      VentureBeat

      News About Tech, Money and Innovation

      digitalerr0r

      Modern art using the GPU

      Theme Showcase

      Find the perfect theme for your blog.

      lsuvietnam

      Learn to Learn

      Gocomay's Blog

      Con tằm đến thác vẫn còn vương tơ

      Toán cho Vật lý

      Khoa Vật lý, Đại học Sư phạm Tp.HCM - ĐT :(08)-38352020 - 109

      Maths 4 Physics & more...

      Blog Toán Cao Cấp (M4Ps)

      Bucket List Publications

      Indulge- Travel, Adventure, & New Experiences

      Lib4U

      ‎"Behind every stack of books there is a flood of knowledge."

      The WordPress.com Blog

      The latest news on WordPress.com and the WordPress community.

      %d bloggers like this: