‎"Behind every stack of books there is a flood of knowledge."

Embedded Computer Architecture



When looking at future embedded systems and their design, especially (but not exclusively) in the multi-media domain, we observe several problems:

  • high performace (100 GOPS and far beyond) has to be combined with low power (many systems are mobile);
  • time-to-market (to get your design done) constantly reduces;
  • most embedded processing systems have to be extremely low cost;
  • the applications show more dynamic behavior (resulting in greatly varying quality and performance requirements);
  • more and more the implementer requires flexible and programmable solutions;
  • huge latencie gap between processors and memories; and
  • design productivity does not cope with the increasing design complexity.

In order to solve these problems we foresee the use of programmable multi-processor platforms, having an advanced memory hierarchy, this together with an advanced design trajectory. These platforms may contain different processors, ranging from general purpose processors, to processors which are highly tuned for a specific application or application domain. This course treats several processor architectures, shows how to program and generate (compile) code for them, and compares their efficiency in terms of cost, power and performance. Furthermore the tuning of processor architectures is treated.

Several advanced Multi-Processor Platforms, combining discussed processors, are treated. A set of lab exercises complements the course.

This course aims at getting an understanding of the processor architectures which will be used in future multi-processor platforms, including their memory hierarchy, especially for the embedded domain. Treated processors range from general purpose to highly optimized ones. Tradeoffs will be made between performance, flexibility, programmability, energy consumption and cost. It will be shown how to tune processors in various ways.

Furthermore this course looks into the required design trajectory, concentrating on code generation, scheduling, and on efficient data management (exploiting the advanced memory hierarchy) for high performance and low power. The student will learn how to apply a methodology for a step-wise (source code) transformation and mapping trajectory, going from an initial specification to an efficient and highly tuned implementation on a particular platform. The final implementation can be an order of magnitude more efficient in terms of cost, power, and performance.


In this course we treat different processor architectures: DSP (digital signal processors), VLIWs (very long instruction word, including Transport Triggered Architectures), ASIPs (application specific processors), and highly tuned, weakly programmable processors. In all cases it is shown how to program these architectures. Code generation techniques, especially for VLIWs, are treated, including methods to optimize code at source or assembly level. Furthermore the design of advanced data and instruction memory hierarchies will be detailed. A methodology is discussed for the efficient use of the data memory hierarchy.
Most of the topics will be supplemented by hands-on exercises.
For a preliminary schedule see: schedule.


The lecture slides will be made available during the course; see also below.
Papers and other reading material 

  • Learn Chapter 2 on Computer Architecture Trends 
    From “Microprocessor Architectures, from VLIW to TTA” by Henk Corporaal, publisher John Wiley, 1998.
  • Related to Data Memory Management:
    • A paper about data reuse. Formalized methodology for data reuse exploration in hierarchical memory mapping.
      J.Ph.Diguet e.a.
    • Code transformations. Code transformations for data transfer and storage exploration preprocessing multimedia processors.
      Francky Catthoor, Nikil D. Dutt, Koen Danckaert and Sven Wuytack
      IEEE Design and Test of Computers, May-June 2001
    • Data storage components. Random-access data storage components in customized architectures
      Lode Nachtergaele, Francky Catthoor and Chidamber Kulkarni
      IEEE Design and Test of Computers, May-June 2001
    • Data optimizations. Data memory organization and optimizations in Application Specific systems
      P.R. Panda e.a.
      IEEE Design and Test of Computers, May-June 2001

Slides (per topic; see also the course description)

** Slides as far as available; will be updated regularly during the course.


Student presentations guidelines

As part of this lecture you have to study a hot topic related to this course, and make a short slide presentation about this topic.
The slides have to be presented during the oral exam.

Guidelines are as follows:

  • Choose one hot topic which interests you and which is highly related to this course.
  • Select one technical (in depth) research paper from the web, based on this topic.
  • The paper should have sufficient technical depth; i.e. it should clearly explain all the details of the proposed method or solution. So e.g. do not choose company white or business papers. You can also check whether the paper is from well perceived journals or conferences, like IEEE, or ACM conferences  and journals (see e.g., and E.g., have a look at the following conferences:
  • A larger list can be found here.
  • The paper should be published in 2010 or later (try to choose a very recent papers).
  • You should make a powerpoint presentation on your topic; max 10 min. per presentation (so about 10 slides; e.g. a few slides introducing the problem, then the approach and results of each paper, and final conclusion and suggestions from your side on this topic; add / use clear pictures to explain the approach)
  • The presentation should contain at least the following:
    • Summary of the paper contirbution (including technical details)
    • Your evaluation of the paper and topic
      • strong points
      • weak points
      • applicability of proposed methodology / solution
      • indicate new / future directions of research
  • In order to evaluate the paper you may have to read related material on the same topic.
  • Your presentation will be evaluated by us. This evaluation will be taking into account for the final grading.

Hands-on lab work

Becoming a very good Embedded Computer Architect you have to practice a lot. Therefore, as part of this course we have put a lot of effort to prepare 3 very interesting lab assignments. For each lab there is a website with all the required documentation and preparation material. These lab assignments can be made on your own laptop, with for certain parts, remote access to our server systems.
For every lab you have to write a report, which has to be sent to one of the course assistants.

Hands-on 1: Processor Design Space Exploratoin, based on the Silicon Hive Architecture

In the past we had several architecture design space exploration (DSE) labs, using the Transport Triggered Architecture (TTA) framework,  using the Imagine Processor, and one using the AR|T tools. This year we base the first lab on  the reconfigurable processor from Silicon Hive
For this excercise:

  • Check the link
    You’ll find several pdf files. Have a look at all of them first.
  • Then check the start-up guide in detail.
  • Thereafter start with the assignment.It also describes what are the deliverables you have to sent in (as a small report). The report should be send to Akash Kumar. He can also help you with questions.

Hands-on 2: Platform Programming

In this lab you are asked to program a (multi-)processor platform. In the past we developed various labs:

  • using the Wica platform (with a 320 PE SIMD processor, the Xetal); see (c) below.
  • using the CELL platform (CELL is used in the e.g. the PlayStation 3; it has besides a PowerPC RISC processor upto 8 other processors for the high-performance kernels; these processors also exploit sub-word SIMD); see (b) below.

This year, 2012, we will take an x86 plus graphic processing unit (GPU) as platform.

Programming Graphic Processing Units

Graphic processing units (GPUs) can contain upto hundreds of Processing Engines (PEs). They achieve performance levels of hundreds of GFLOPS (10^9 floating point operations per second). In the past GPUs were very dedicated, not general programmable, and  could only be used to speedup graphics processing. Today, they become more-and-more general purpose. The latest GPUs of ATI and NVIDIA can be programmed in C and OpenCL. For this lab we will use NVIDIA GPUs together with the CUDA (based on C) programming environment. Start with setting up the CUDA environment, studying the available learning materials, and running the example programs.
We added one extensive example program, about matrix multiplication, which demonstrates various GPU programming optimizations.
You will see getting something running using CUDA is not so difficult, but getting it efficiently running will take quite some effort.
After studying the example and learning material you have to perform your own assignment and hand in a small report. The purpose is
the use your GPU as efficient as possible.
All the details about this assignment can be found on the GPU-assignment site.
The assignment is made by Dongrui She and Zhenyu Ye. For questions contact d.she _at_
When finished, send in a small report about your result and various applied optimizations to Dongrui She.

Hands-on 3: Exploiting the data memory hierarchy for high performance and low power

In this exercise you are asked to optimize a C algorithm by using the discussed data management techniques. This should result into an implementation which shows a much improved memory behavior. This improves performance and energy consumption. In this exercise we mainly concentrate on reducing energy consumption. You need to download the following, and follow the instructions.

The 2011 year assignment can be found here. The algorithm is based on Harris corner detection.
You will start with a default platform, containing 2 levels of cache. First calculate the results of your code optimizations
for this platform. Thereafter you are free to tune the platform for the given application, e.g. changing the caches, or even
use ScratchPad memory (SRAM) instead of, or in addition to, caches.
Success !


The examination will be oral about the treated course theory, the lab report(s), and studied articles.
Likely week: 4th week of January 2013.
Grading depends on your results on theory, lab exercises and your presentation.

Related material and other links

Interesting processor architectures:

  • The cell architecture, made by Sony, IBM and Toshiba, and used e.g. in Playstation 3
  • TRIPS architecture, combining several types of parallelism
  • The tile based RAW architecture from MIT
  • Imagine, a hybrid SIMD – VLIW architecture from Stanford
  • Merrimac, the successor of the Imagine
  • ChipCon, check e.g. their system-on-chip: CC1110
  • MAXQ from MAXIM, Dallas; a Transport Triggered Architecture
  • Aethereal, a Network-on-Chip from Philips



  • Epicurus Research Program of our Electronic Systems group
  • PARSE: Parallel Architecture Research Eindhoven
    On this website you can also find:

  • NEST logoNEST consortium: Netherlands Streaming project
    In this project the major Dutch expert groups on data flow processing and architectures are researching future streaming systems and corresponding design flows. Key issues are SoC, NoC, predictable and composable design.
    Tools will be made available.
  • EVA: Embedded Vision Architectures. EVA is the successor of the SmartCam project.
  • Book on Transport Triggered Architectures
    Mirroring the programming paradigm from operation triggering to transport triggering
    The VLIW solution for embedded systems.
  • Dataflow Benchmark Suite
    This project aims to provide a uniform set of dataflow models for streaming applications.
    Computing frabric for high performance applications
    European Catrene Project
    Motion analysis with contactless camera sensing for professional polysomnography and baby sleep-watching at home
    Point-One project

Finished Projects

  • PreMaDoNA project
    Predictable Matching of Demands on Networked Architectures
    About predictable design of Networks-on-Chip (NoC) based systems
  • SmartCam Project
    Multi-Processor Architecture and Application Design Environment for Smart Cameras
  • Soft Reliability Project
    Control soft reliability problems in technical, end-user, and business process terms.
  • [MOVE project
            logo] MOVE project home page TUD
    TCE site (TTA-based Codesign Environment) of Tampere university
    Research on the semi-automatic design of embedded processors and systems, based on Transport Triggered Architectures


  • An overview of my current and previous Courses
  • Interested in a mini Network-on-Chip targeted towards FPGAs? Check out our mMIPS Network page.
    We used this in several courses on computer architecture.
  • PhD students at TU/e
  • Master students at TU/e (* to be filled in *)
  • Previous PhD Students at TUDelft
  • Previous Master Students at TUDelft
  • Literature suggestions on computer architecture and embedded systems. Overview of interesting conferences and journals.


Software and Tools from our group

  • Is your application dynamism getting out of control? Check our Wiki Scenario page.
  • Extension of SDF: SADF (Scenario Aware Data Flow), combining analytic power of (C)SDF with dynamism
  • MAMPS Multi-Application Multi-Processor Synthesis
  • SDF3: SDF For Free, a set of high freely available SDF (Synchronous Data Flow) and CSDF (Cyclo-Static Data Flow) analysis, transformation, graph generation and visualization tools
  • POOSL, System-Level Design with the SHE Methodology.
    POOSL (Parallel Object-Oriented Specification Language) is used to describe, analyse and synthesize hard- and soft-realtime systems. Synthesis is correct modulo a small timing deviation, i.e. the implemented system has the same timing properties as the high level description in POOSL.
  • PREMADONA tools: Generate simulation models from XML specifications of NoC-based MPSOC systems.
  • Always wanted to know what users do with your products? Check our tools within the soft reliability project.

Other links



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Virtual Fashion Technology

Virtual Fashion Education


"chúng tôi chỉ là tôi tớ của anh em, vì Đức Kitô" (2Cr 4,5b)


News About Tech, Money and Innovation


Modern art using the GPU

Theme Showcase

Find the perfect theme for your blog.


Learn to Learn

Gocomay's Blog

Con tằm đến thác vẫn còn vương tơ

Toán cho Vật lý

Khoa Vật lý, Đại học Sư phạm Tp.HCM - ĐT :(08)-38352020 - 109

Maths 4 Physics & more...

Blog Toán Cao Cấp (M4Ps)

Bucket List Publications

Indulge- Travel, Adventure, & New Experiences


‎"Behind every stack of books there is a flood of knowledge."

The Blog

The latest news on and the WordPress community.

%d bloggers like this: