# International Workshop on Design Principles for Next Generation Embedded Computing Systems

#### Amit Kumar Singh

**School of Computer Science and Electronic Engineering** 

University of Essex

United Kingdom

W: <u>http://aksingh.co.uk/</u>

E: <u>a.k.singh@essex.ac.uk</u>

University of Essex

# Multi-core Based Next Generation Embedded Systems

#### **Multi-core Systems Revolution**

#### Single Core Performance:

- Steady until 2002
- Performance has fallen off Moore's Law
  - Maximum operational frequency has hit the roof



Parallel processing is the only choice

#### **Evolution in number of cores**



When poll is active, respond at pollev.com/amitsingh510
 Text AMITSINGH510 to 22333 once to join

#### How many cores in a modern chip?



Start the presentation to see live content. For screen share software, share the entire screen. Get help at pollev.com/app

Э

#### **Multi-core based Systems**







#### **Multi-core wearable devices**

| Product<br>(Announced)                                | SoC                           | CPU<br>(#core)                    | Freq<br>(MHz) | Memory                 | Typical<br>CPU<br>Power<br>(mW) |
|-------------------------------------------------------|-------------------------------|-----------------------------------|---------------|------------------------|---------------------------------|
| Google Glass<br>(Apr, 2012)                           | TI<br>OMAP4430                | ARM<br>Cortex-A9<br>(dual-core)   | 1000          | 2GB RAM<br>16GB Flash  | 350                             |
| Vuzix M100<br>(Jan, 2013)                             | TI<br>OMAP4460                | ARM<br>Cortex-A9<br>(dual-core)   | 1200          | 1GB RAM<br>4GB Flash   | 400                             |
| Qualcomm<br>toq<br>(Oct, 2013)                        | ST<br>STM32                   | ARM<br>Cortex-M3<br>(single-core) | 200           | 16MB SRAM<br>2GB Flash | 10                              |
| Optinvent<br>ORA-1<br>(Aug, 2014)                     | Not available                 | ARM Cortex<br>(dual-core)         | 1200          | 4GB Flash              | Not available                   |
| Sony<br>Smartwatch 3<br>(Sep, 2014)                   | Qualcomm<br>Snapdragon<br>400 | ARM<br>Cortex-A7<br>(quad-core)   | 1200          | 512MB RAM<br>4GB Flash | 450                             |
| LG<br>G watch R<br>(Sep, 2014)                        | Qualcomm<br>Snapdragon<br>400 | ARM<br>Cortex-A7<br>(quad-core)   | 1200          | 512MB RAM<br>4GB Flash | 450                             |
| Samsung<br>Gear S2 3G<br>(Aug, 2015)                  | Not available                 | ARM<br>Cortex-A7<br>(dual-core)   | 1000          | 512MB RAM<br>4GB Flash | Not available                   |
| Motorola<br>Moto 360 2ed<br>generation<br>(Sep, 2015) | Qualcomm<br>Snapdragon<br>400 | ARM<br>Cortex-A7<br>(quad-core)   | 1200          | 512MB RAM<br>4GB Flash | 450                             |

Source: Tan et. al. "LOCUS: Low-Power Customizable Many-Core Architecture for Wearables"

# **Multi-core Platform Examples**



푊

#### ODROID XU3 – 8 core big.LITTLE CPU + 6 cores GPU



Parallella - Dual core CPU + FPGA + 16 cores NoC



#### Heterogeneous Multi-core Usage



#### Exynos 5422 SoC



When poll is active, respond at **pollev.com/amitsingh510**Text AMITSINGH510 to 22333 once to join

# What if there is no heterogeneity in terms of processing capability of cores?

There will be no impact

lt can impact performance

#### **Heterogeneity Exploitation**



#### **Applications Execution on Multi-cores**



#### **Applications Execution on Multi-cores**

- Applications Representation
  - Task parallelism





T4

[2]

то

́т6 [3] ТЗ

[5]

[3]

T2 [7]

> \_⊥2 161

Τ1

[5]

Respond at **pollev.com/amitsingh510** Text **AMITSINGH510** to **22333** once to join, then **A, B, C, or D** 

# If an application cannot have parallel representation, how to best improve its performance?

Performance cannot be improved **A** 

By assigning to a suitable processing element **B** 

By designing a dedicated circuit for it **C** 

By running on a GPU **D** 

#### **Important metrics for Embedded Systems**

• Performance



• Energy



• Temperature









• Security

#### Can you think of any other metric?

# **Optimisation Knobs/Controls**

#### **Applications Mapping on Multi-cores**



- Mapping process defines assignment and ordering of the tasks and their communications onto the platform resources in view of some optimization criteria such as energy consumption and compute performance.
- Solving where, when, why (objective) problem.

#### **DVFS for Optimisation**

- Dynamic Voltage and Frequency Scaling (DVFS)
  - Supported in many advanced processor

 Marvell StrongARM, Intel XScale, Transmeta Crusoe, ARM1176, many other modern processors



P: Processor; M: Memory; NI: Network Interface

#### **DVFS for Energy Savings**



#### **Dynamic power management (DPM)**

- Shut downs processing elements (PEs) when inactive
  - Greedy: Go to sleep as soon as processing is finished
  - Timeout: Stay on expecting a new request. After time t of idleness go to sleep

When poll is active, respond at **pollev.com/amitsingh510**Text AMITSINGH510 to 22333 once to join

#### Which approach you think is better?

Greedy

#### Timeout

#### **Application Partitioning for Optimisation**

#### Partitioning for Single (data-parallel) Application



**OpenCL** provides this opportunity

# Any other ways to optimise any metrics (e.g. performance, energy, temperature, reliability and security)?

#### **Example Application Domains for Optimisations with Earlier Principles**

#### **Multimedia Applications**



• Load and adapt the application tasks on system resources at run-time

#### **Automotive Applications**



Load active tasks during various modes without any deadline violation

#### **Server and Cloud Computing Applications**



# Any other application domains you can think of Optimisations with Earlier Principles?



When poll is active, respond at **pollev.com/amitsingh510**Text AMITSINGH510 to 22333 once to join

# Can we have any other option apart from design-time and run-time optimisations?

Yes

No

## **Design-time Optimisations**

#### **Design-time Optimisations**

- The optimization is performed with global and thorough view of system resources
- Normally, a better quality of result is achieved than the run-time optimisations
- A huge literature falls under design-time optimisation category

## **Design-time Optimisation Principles**

- Different well established search approaches have been extensively used
  - Simulated Annealing (SA)
  - Tabu Search
  - Integer Linear Programming (ILP)
  - Genetic Algorithm (GA)
  - ...
  - Pruning strategies have been incorporated to prune the search space in order to reduce the computational costs.

When poll is active, respond at pollev.com/amitsingh510
 Text AMITSINGH510 to 22333 once to join

# Do we have any disadvantage(s) to prune the search (design) space?

No

Yes

## **Design-time Exhaustive DSE**



#### Pruning-based DSE



## **DSE Extension to Heterogeneous Tiles**

**DSE** for an application modeled with 3 tasks (actors): a1, a2 and a3



General Purpose Processor (GPP)



Digital Signal Processor (DSP)



Accelerator (ACC)

## **Pruning Advantage**

| Number of Tasks | Number of Mappings |         |                                 |         |
|-----------------|--------------------|---------|---------------------------------|---------|
|                 | Homogeneous tiles  |         | Heterogeneous: 2 types of tiles |         |
|                 | Exhaustive         | Pruning | Exhaustive                      | Pruning |
| 1               | 1                  | 1       | 2                               | 2       |
| 2               | 2                  | 2       | 6                               | 6       |
| 3               | 5                  | 5       | 22                              | 15      |
| 4               | 15                 | 11      | 94                              | 31      |
| 5               | 52                 | 21      | 454                             | 56      |
| 6               | 203                | 36      | 2,430                           | 92      |
| 7               | 877                | 57      | 14,214                          | 141     |
| 8               | 4,140              | 85      | 89,918                          | 205     |
| 9               | 21,147             | 121     | 610,182                         | 286     |
| 10              | 115,975            | 166     | 4,412,798                       | 386     |
| 14              | 190,899,322        | 456     | 20,732,504,062                  | 1,016   |

- Exhaustive evaluation takes more than 24 hours beyond 10 tasks
- Pruning based design exploration process accelerates exploration significantly

#### **Communication-aware DSE**



#### **Only Connected tasks are mapped on the same core**

#### Facts about each design point evaluation

• The simulation time to evaluate the design points forms the real bottleneck in the DSE When poll is active, respond at **pollev.com/amitsingh510**Text AMITSINGH510 to 22333 once to join

# What can be done to accelerate the DSE process?

Analytical estimations can be considered.

A combined simulation and estimation can be considered.

Start the presentation to see live content. For screen share software, share the entire screen. Get help at pollev.com/app

### **Accelerating Pruning-based DSE process**

#### **Trace-based Analysis and Simulation**



Execution Trace: (For easier realization, shown for one period while considering rates as 5 in places of 2376)



### **Accelerating Pruning-based DSE process**

#### **Trace-based Analysis and Simulation**



### **Estimation method**

 Period (1/Throughput) of the mapping using (p - 1) tiles is estimated by utilizing period of the mapping using p tiles as follows:

$$period_{\beta} = period_{\alpha} + gain_{\alpha,\beta} + loss_{\alpha,\beta}$$

• *gain* and *loss*: are the increase and decrease in the period of the mapping using p tiles when the new mapping is generated by moving actors from one tile to another.

#### **Example Demonstration**

• **Period increase:** When parallel executing actors mapped on selected pair of tiles are forced to execute sequentially by mapping the actors on the same tile.



• **Period decrease:** When execution of the edge(s) between the selected pair of tiles is not in parallel with other actors and edges.

#### References

- Singh, Amit Kumar, Akash Kumar, and Thambipillai Srikanthan.
  "Accelerating throughput-aware runtime mapping for heterogeneous mpsocs." *ACM Transactions on Design Automation of Electronic Systems* (*TODAES*) 18, no. 1 (2013): 1-29.
- Singh, Amit Kumar, Muhammad Shafique, Akash Kumar, and Jörg Henkel. "Mapping on multi/many-core systems: Survey of current and emerging trends." In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1-10. IEEE, 2013.
- Singh, Amit Kumar, Anup Das, and Akash Kumar. "RAPIDITAS: RAPId design-space-exploration incorporating trace-based analysis and simulation." In *2013 Euromicro Conference on Digital System Design*, pp. 836-843. IEEE, 2013.

When poll is active, respond at pollev.com/amitsingh510
 Text AMITSINGH510 to 22333 once to join

# Above design-time DSE approaches take only number of cores into account. What if each core supports DVFS?

Design points will remain the same.

Design points will increase exponentially.

Start the presentation to see live content. For screen share software, share the entire screen. Get help at pollev.com/app

#### **Example DSE with DVFS**



#### Genetic Algorithm (GA) with an Example (considering both Mapping and DVFS)

#### -> Next topic

# Questions?