# Performance Enhancement of NoCs using Single Cycle Deflection Routers and Adaptive Priority Schemes

Midhula K.S<sup>1</sup>, Sarath Babu<sup>1</sup>, John Jose<sup>2</sup>, and Sangeetha Jose<sup>1</sup>

<sup>1</sup> Government Engineering College Idukki, Kerala, India {midhulaks01,sarathbabu0410,sangeethajosem}@gmail.com <sup>2</sup> Indian Institute of Technology Guwahati, Assam, India johnjose@iitg.ernet.in

Abstract. It is important to design an energy efficient underlying communication framework for multicore systems. The communication framework must satisfy the requirements of NoC (Network on Chip) such as minimum latency and minimum critical path delay. Routing on multicore framework help to compute the route to which the flit wants to reach its destination. Buffered routing consumes more power and area of the chip due to the presence of in-router buffers and buffer-less routing causes more number of deflections due to unavailability of productive port on contention. Hence, design of a minimally buffered deflection router having reduced power consumption and deflection rate is critical. There are minimally buffered deflection routers, which are characterized with minimum buffering and reduced latency. Nevertheless, limitations still exist such as higher flit latency, deflection rate. In this paper, we propose a single cycle minimally buffered deflection router with a good prioritization mechanism which leads to minimum latency and reduced deflection rate than conventional minimally buffered deflection router (MinBD). This improves the quality of NoC by prioritizing aged flits which are side buffered, redirected and re-injected in the router pipeline.

**Keywords:** Network on Chip · Minimally Buffered Deflection Router · Single cycle Minimally Buffered Deflection Router.

# 1 Introduction

A processor is an electronic circuitry integrated inside a chip that resides in a computer. The processor fetches, decode, execute instructions and give back its appropriate results. Based on the number of cores inside the processor, there are two different types of processors known as, unicore processor and multicore processor. Unicore processors have single processing element which can handle single thread at a time. Whereas, multicore processors contain multiple processing elements inside a processor and can process different tasks simultaneously. The process allocation to multiple cores is done by the operating system. The

multicore processor can reduce power consumption which leads to its performance enhancement. Hence, the need for faster processors is become more and more apparent and these multicore processors are popular in the world.

Network on Chip (NoC) is the communication paradigm for the underlying communication framework of integrated circuits. This communication subsystem consists of a number of cores connected with a network of routers. Routers are used to connect the channels at junctions in the multicore processor. Each router having 'p' number of input channels and output channels. Normally, the number of channels, 'p' is five. It is important to design energy efficient router microarchitecture for communication.

There are many situations where flits from different input ports request the same output port simultaneously. It can be handled by using either buffering or deflection. A buffered router can store and forward the flits using buffers. However, in most of the cases, low injection rate application uses only less than 25% of the buffers [15]. This exposes the over provisioning of buffers in the router. This causes wastage of the die area and static power. Buffer-less deflection router save the power and die area over the buffered router. However, buffer-less deflection router increases the number of flits which get deflected at high injection rate. Hence, minimally buffered deflection router is the best choice for energy efficient router design. Minimally buffered deflection router incorporates the advantages of both buffered and buffer-less router. This consists of a minimum buffer which act as 'side buffer' to store some deflected flits in order to reduce the deflection rate and thereby also reduces the average latency.

In this paper, we propose a single cycle minimally buffered deflection router with different prioritizations concerns at different stages. Experiments on an 8x8 mesh with synthetic traffic patterns [3] show that single cycle MinBD performs better than the existing Minimally Buffered Deflection router (MinBD) in terms of average flit latency and deflection rate.

# 2 Buffer-less Deflection Router : Related Work

Traditionally routers are designed with buffers (buffered routers) to store and forward the incoming flits to the next output router. However, these routers are power hungry routers as they consume more power and reduces the performance. Buffer-less deflection routers are introduced to address the limitations of buffered routers and reduces the die area. Buffer-less routers can handle the contention in network links with two mechanisms such as, drop one mechanism and deflection mechanism [3]. Drop one mechanism has a negative impact on re-transmission of the dropped flit. There is no guarantee for the source, which is near to the contention router. Hence, the complexity increases. In the case of deflection mechanisms, some amount of network traffic is sent to another link instead of using a buffer to store the traffic. The deflected flit will reache the destination with extra link traversal [9].

The first buffer-less deflection router introduced in BLESS [8] by Moscibroda et al., with sequential port allocation for all the incoming flits. It increases the

critical path latency of flit in the router. C. Fallin et al. introduced CHIPPER [10] with parallel port allocation and golden packet prioritization in buffer-less deflection router. It did not reduce the deflection rate despite of the use of golden packet prioritization scheme. The golden packet prioritization scheme is only beneficial for a single flit with in specific period of time. There is no effective impact on ordinary flits for reducing its chances of deflection.

Minimally buffered deflection router (MinBD) is a promising solution which effectively combines the best features of both buffered and buffer-less routers. Rather than using in-router buffers, MinBD uses a simple side buffer. The side buffer is used to store only one flit among the deflected flits in a single cycle and the side buffered flit will be re-injected into the pipeline again in subsequent cycles.

The best available two-stage deflection router is DeBAR [15] by John Jose et al. have better prioritization scheme, hybrid ejection and parallel execution of independent operations. SLIDER [12] by B. Nayak et al. is also, a two-stage deflection router with smart late injection and selective flit pre-emption features.

There have been prior works on single cycle router architecture. However, their design methodologies and intentions are different. SCARAB [17] by Hayenga et al. is the buffer-less adaptive single cycle router architecture. Minimally buffered single cycle deflection router (MinBSD) [13] by John Jose et al. is the single cycle deflection router having an innovative module, which is capable of handling all the operations in a single cycle. While these architectures have structural limitations and latency problems.

# 3 Motivation



Fig. 1. MinBD pipeline architecture

MinBD [11] is the two-cycle deflection router with different modules of core injection, redirection, re-injection, 2-stage ejection etc shown in Fig. 1. Injection from the core occurs only if there is an empty port and core injection module is placed after the re-injection module. Re-injection module is used for the reinjection of side buffered flits from the side buffer. Ejection of flit from the router to the local port is handled by a two-stage ejection unit. Redirection module is used for the pre-emption of flits from input ports when the four input ports and side buffer are busy. The side buffer can re-inject the side buffered flit to the empty port. The second stage of the MinBD pipeline starts with silver flit selection module followed by permutation deflection stage. Silver flit selection is the prioritization scheme [16] used in MinBD. MinBD randomly selects a flit as silver, which is the most prioritized flit for the local router. Permutation deflection network used for the parallel allocation of output ports. This module checks all the four incoming flits for the presence of silver marking in order to give higher prioritization. We identify four performance limitations in the MinBD design that motivated us for the proposed work. We analyse these limitations and its after effects, and suggest suitable solutions.

# 3.1 The Lifetime of a flit in a router

In MinBD, an incoming flit will take two-cycles to complete its traversal within a router. Average flit latency is very much related with this time spent by flit in the router. Flit delay can be reduced by reducing the router delay and the router delay can be reduced by designing a single-cycle deflection router. The average flit latency can be reduced with single cycle deflection router by reducing the time taken by the flit for completing its operations within the router.

### 3.2 Aged flits are side buffered

The purpose of side buffer in MinBD is to store some flits which otherwise would be deflected. The selection of those flits are random and side buffering reduces the deflection rate. However, the problem is that flit from the side buffer is reinjected only if there is an empty port. Otherwise, it is side buffered for a long time. Also, there is no guarantee for getting productive port after re-injection of the buffered flit. Our experiments with uniform traffic at pre-saturation load in 8x8 mesh network employing MinBD show that 49% of aged flits are side buffered. This problem arises due to the random flit selection among those flits for side buffering.

### 3.3 Aged flits are deflected

As per the silver flit prioritization in MinBD, only one silver flit gets prioritized over other flits in the corresponding router locally. The possibility for aged flit as silver flit is 25% only. The aged flits get deflected in most of the cases. Our experiments using uniform traffic at pre-saturation load on 8x8 mesh network employing MinBD show 48% deflected flits are aged ones. This increases the average flit latency and degrades the overall performance.

# 3.4 Aged flits are pre-empted/redirected

Based on the prioritization scheme used in MinBD, silver flit gets prioritization over other flits. Consider the case when all the four input ports of the router are busy with incoming flits and none of the flits are destined to be ejected, injection or re-injection of flits from respective input queue or side buffer is not possible. However, the aged flit in the side buffer get freezes and re-injection is not possible. This situation can be handled by redirection/pre-emption of a random flit from an input channel to side buffer. Since redirected flit selected randomly, there is a chance for aged flit redirection. Our experiments using uniform traffic at pre-saturation load on 8x8 mesh network employing MinBD gives a significant number of redirected flits are aged. Fig. 2 shows that the intensity of these limitations are increases with injection rate.



Fig. 2. Statics of aged flit side buffered and aged flits deflected for varying injection rates using uniform traffic in 8x8 mesh with MinBD.

Our experimental studies on various synthetic traffics show that the metric that decides when and whether deflect a flit or side buffer a flit has a significant impact on the average flit latency and overall performance. To overcome above mentioned limitations, we suggest a modification in the design of MinBD that brings in a major improvement in performance by achieving reduction in average latency and in deflection rate. We can solve all the poroblems by changing the silver flit prioritization scheme for permutation stage, and random flit selection for side buffering and pre-emption. We propose a single cycle minimally buffered deflection router with enhanced adaptive prioritization schemes to improve performance.

# 4 Single Cycle Minimally Buffered Deflection Router Architecture

Our proposed system is an advancement of minimally buffered deflection router (MinBD). A block diagram of various stages of the router pipeline of single cycle minimally buffered deflection router is shown in Fig. 3. Single cycle minimally buffered deflection router differs from MinBD as the following ways:

- Total cycles used by the flit in a router is reduced to one in single cycle minimally buffered deflection router. Conventional MinBD works as a twocycle deflection router.
- The priority scheme used for routing is modified such a way that the flit with nearest destination among the four flits will get highest priority. This will avoids deflection of flit which is have nearest destination.
- The priority scheme used for the selection of flit for side buffering is changed from random flit selection to least aged flit selection. This scheme helps to save the aged flits from side buffering and it will reduces the latency.
- Introduces a prioritization scheme which selects flit with minimum deflection for redirection. This scheme avoids the chances of redirection of highly deflected flits to side buffer.
- The priority scheme provides high priority for redirected flits than other flits. This gives importance to the flits which are either side buffered or redirected.

Internal architecture of the proposed system as follows.



Fig. 3. Proposed Architecture

# 4.1 Double Ejection Unit

Double ejection unit is used for the ejection of flits from the router to local port. Flits stored in the pipeline are passed through the double ejection unit. This module have separate two stages of ejection. Double ejection unit can eject up to two locally destined flits in a cycle. Otherwise, a locally addressed flit would be deflected rather than ejected if only one ejection unit used in a single cycle. One of the significance of this module is, it helps to make empty ports in the router by router ejection. That empty ports are used by the core for injecting flits to the router. Double ejection unit will avoid the ejection bottleneck in the network.

#### 4.2 Injection Unit

Injection unit is used to inject flits from the injection queue. Local node can only inject flit to an empty port. This unit can push only one flit per cycle to the router pipeline.

# 4.3 Prioritization Unit

Prioritization unit is used for fixing the priority of each incoming flit based on hop-to destination value. Prioritization unit computes the priority value by extracting the destination address from each incoming flit and assigns its priority. This prioritization unit gives high priority to the flit with nearest destination address.

### 4.4 Permutation Deflection Network (PDN)

The PDN is the two-stage arbitration circuit used for the parallel allocation of output ports. Arbitration circuit has four arbitrs which check the priority and output port of each incoming flit. Comparing the priority value assigned by prioritization unit, the arbiter will allocate to productive port or non-productive port (deflected) for every flit.

#### 4.5 Buffer Ejection Unit

Buffer ejection unit is used for storing the flits to the side buffer. This module reduces the deflection rate by selecting the least aged flit for side buffering among the deflected flits. It avoids side buffering of aged flits gently.

### 4.6 Side buffer

Side buffer stores the flit and reduces the deflection rate. It only side buffers least aged flit that otherwise should have been deflected to non-productive port. In later time, the flit will be re-injected into the router pipeline, when there is an empty port.

#### 4.7 Re-injection Unit

Re-injection unit is used to re-inject the flits from side buffer. Re-injection operation only occur when there is a free port. This unit is placed prior to the injection unit and it makes the re-injected flit with the highest priority among all other flits in the arbitration.

## 4.8 Pre-emption Unit

NoC at higher injection rate experiences penalization of flits by all input ports busy with neighbouring flits and no flit ejection. This situation prevents the injection of flit from injection queue and re-injection of flit from side buffer. The pre-emption unit picks minimum deflected flit among the incoming flits and places it in the side buffer. This flit will returned to the router pipeline by re-injection with highest priority.

# 5 Experimental Analysis

# 5.1 Simulation Setup

We modified the traditional VC based, cycle accurate input buffered NoC simulator Booksim [3] to model the single-cycle minimally buffered deflection router microarchitecture mentioned in CHIPPER [10]. We considered flits with necessary header information for independent routing of flits used for deflection routing. We made changes on the microarchitecture of baseline deflection router simulator to model MinBD router and proposed single cycle minimally buffered deflection router, and conducted the experimental result analysis.

We evaluated the performance of MinBD and proposed router in an 8x8 mesh network using four synthetic traffic patterns: *uniform, transpose, tornado, and bit-complement.* We captured the results on the behalf of average flit latency, deflection rate, count of aged flits deflected, redirected and side buffered by varying the injection rate from very low load till saturation and analyze the results. We also evaluated the proposed router design using multiprogrammed workloads consisting of SPEC2006 CPU benchmark mixes.

We also examined the results of the proposed system in detail with the MinBD using multiprogrammed SPEC CPU2006 benchmark mixes. We model 64-core CMP setup with cache hierarchy and coherent protocols using gem5 [18] simulator. Each core be composed of an out-of-order x86 processing unit with a 64KB, dual ported, unified, private L1 cache. We use 4-way associative L1 cache with block size of 32B and a shared distributed 16-way associative L2 cache with block size of 64B.

SPEC CPU2006 benchmark applications are used to run on each core for the evaluation. Based on the *misses per kilo instructions* (MPKI) on L1 cache, benchmarks are classified into *Low* (less than 5), *Medium* (between 10 and 20), and *High* (greater than 25) as shown in Table 1.

| Percentage miss rate            | Benchmarks                       |
|---------------------------------|----------------------------------|
| Low MPKI (less than 5)          | specrand, sjeng, calculix, namd  |
| Medium MPKI (between 10 and 20) | aster, sphinx, libquantum, bzip2 |
| High MPKI (greater than 25)     | lbm,soplex, mcf, leslie3d        |

Table 1. Classification of applications based on MPKI

We create 15 multiprogrammed workloads, each having 64 applications chosen from the SPEC CPU2006 benchmark suite. Based on the network injection intensity (Low/Medium/High), the workloads are categorized into 3 mixes (M1 to M3) as shown in Table 2. Consider mix 1 (M1); where out of 64 cores that we model, 16 cores run specrand, 16 cores run sjeng, 16 cores run calculix and last 16 cores run namd benchmark. Similarly, other workload mixes (M2 and M3) can also be described.

Table 2. Workload Constitution

| Workload $\#$ | SPEC 2006 Benchmarks |                             |                               |              |
|---------------|----------------------|-----------------------------|-------------------------------|--------------|
| M1            | specrand(16)         | sjeng(16)                   | $\operatorname{calculix}(16)$ | namd(16)     |
| M2            | aster(16)            | $\operatorname{sphinx}(16)$ | libquantum(16)                | bzip2(16)    |
| M3            | lbm(16)              | $\operatorname{soplex}(16)$ | mcf(16)                       | leslie3d(16) |

# 5.2 Results & Discussions

We compared the performance of proposed single cycle minimally buffered deflection router with conventional two-cycle MinBD with 4-flit side buffer. We use xy-routing with adaptive prioritization schemes.

# Effect on number of aged flits side buffered, redirected and deflected

While changing the prioritization scheme used in MiBD to an adaptive prioritization, the percentage of conflicts against aged flits are reduced. By reverting the aged flits from deflection, redirection and side buffering will help to reduce the average latency and deflection rate. Fig. 4 shows the reduction in number of aged flits side buffered, deflected and redirected in proposed system with respect to MinBD in *uniform* traffic.

10 K.S. Midhula et al.



**Fig. 4.** Reduction in number of aged flits side buffered, deflected and redirected in proposed system with respect to MinBD in *uniform* traffic.



Fig. 5. Average flit latency comparison for various synthetic traffic patterns in 8x8 mesh network.

### Effect on Average Latency

The Latency of a flit is defined as the total amount of time taken by the flit from its source to reach its ultimate destination. Fig. 5 shows the plots of injection rate vs average flit latency for various synthetic patterns of an 8x8 mesh network. Average latency increases with injection rate and at a specific point (saturation point) in each synthetic traffic, average latency will increase exponentially. Proper flit management in terms of side buffering, redirection and deflection for right flit extends the saturation point further, which gives the network more load handling capacity than MinBD.

We observed that for all synthetic traffic patterns, the single cycle minimally buffered deflection router having an extended saturation point and low average latency than MinBD during pre-saturation load. This is due to the techniques applied for selecting the flits for side buffering, deflection and redirection.

Fig. 6 shows the latency results using multiprogrammed workloads with low (M1), medium (M2), high(M3) and average(Avg) network intensity workloads. The plot shows the percentage reduction in average latency with respect to MinBD for SPEC CPU 2006 benchmark mixes. The values ploted in the graph are calculated using the equation:

$$\begin{array}{l} Percentage \ Reduction \ w.r.t \ MinBD = (Latency \ of \ MinBD - \\ Latency \ of \ Proposed \ system) / \ Latency \ of \ MinBD \end{array}$$
(1)

We observed that proposed system in all the workloads shows reduction in average flit latency.



Fig. 6. Reduction in average flit latency rate for proposed system with respect to MinBD for multiprogrammed workloads.

#### Effect on Deflection Rate

Deflection rate is defined as the average number of deflections per flit. Reduction in deflection rate not only reduces the network activity but also reduces the dynamic power consumption. Fig. 7 shows deflection rate comparison for various synthetic traffic patterns. When the injection rate is low, then deflection rate is also less because there are not many conflicts occurring between flits for the output port or chances for contention is very less. But at higher injection rate, chances for contention is high. The proposed system will check the deflected flit and gives more preference for older flits.



Fig. 7. Deflection rate comparison for various synthetic traffic patterns in 8x8 mesh network.

Fig. 8 shows normalized deflection rate of proposed system with respect to MinBD for SPEC CPU 2006 benchmark mixes. M1 indicates low network intensity workload, M2 indicates medium network intensity workload, M3 indicates high network intensity workload and Avg indicates average of all. In every workloads, the proposed system maintains a significant reduction in deflection rate. Proposed system only deflects the flit with less priority based on the hop to destination prioritization scheme. Hence, deflection rate shows decrement than MinBD with random deflection flit selection.



Fig. 8. Normalized deflection rate for proposed system with respect to MinBD for multiprogrammed workloads.

# 6 Conclusion

In this work, we analyzed the existing state-of-the-art MinBD and identified few drawbacks due to insufficient prioritization schemes and improper consideration of aged flits at several stages of the pipeline with multiple cycles. Since, aged flits are the main cause of increased average latency, it is important to give attention for those flits. We provided an efficient single cycle minimally buffered deflection router with adaptive prioritization schemes that reduces the average latency and deflection rate to a significant extent. We further showed that the NoCs using single cycle minimally buffered deflection router can operate and achieve a considerable improvement in the network level performance.

# References

- Rantala, V., Lehtonen, T., Plosila, J.: Network on Chip Routing Algorithms. TUCS Technical Report No 779, pp. 10-12, 2006.
- Muralidharan, D., Muthaiah, R.: Bufferless Routing Algorithms: A survey. In: International Journal of Applied Engineering Research, pp. 3811-3813, 2016.
- Dally, W. J., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, San Francisco, CA, 2004.
- Peh, L. -S., Keckler, S. W., Vangal, S.: On-Chip Networks for Multicore Systems. In: Keckler, S. W. Olukotun, K. Hofstee, H. P. (eds.) *Multicore Processors and Systems, Integrated Circuits and Systems*, pages 35-71. Springer US, 2009.
- Stojanovic, I., Jovanovic, M., Djosic, S. Djordjevic, G.: Improved deflection routing method for bufferless networks-on-chip. IEEE, 2013.
- Baran, P.: On distributed communications networks. In: *IEEE Transactions on Communications Systems*, pages 1-9, 1964.
- Cai, Y., Mai, K., Mutlu, O.: Comparative Evaluation of FPGA and ASIC Implementations of Bufferless and Buffered Routing Algorithms for On-Chip Networks. In: 16th International Symposium on Quality Electronic Design (ISQED), 2015.
- Moscibroda, T., Mutlu, O.: A case for bufferless routing in on-chip networks. In: Proceedings of the Fourth International Symposium on Networks-on-Chip, pages 9-16, 2010.
- Fallin, C., Nazario, G., Yu, X., Chang, K., Ausavarungnirun, R., Mutlu, O.: Bufferless and Minimally-Buffered Deflection Routing. In: *Routing Algorithms in Networks-on-Chip, Palesi and Daneshtalab (eds)*, pp. 241-275, 2014.
- Fallin, C., Craik., Mutlu, O.: CHIPPER: A low-complexity bufferless deflection router. In: International Symposium on High Performance Computer Architecture (HPCA), pp. 144155, 2011.
- Fallin, C., Nazario, G., Yu, X., Chang, K., Ausavarungnirun, R., Mutlu, O.: MinBD: Minimally-buffered deflection routing for energy-efficient interconnect. In: International Symposium on Networks-on-Chip (NOCS), pp. 110, 2012.
- Nayak, B., Jose, J., Mutyam, M.: SLIDER: Smart Late Injection DEflection Router for Mesh NoCs. In: *International Conference on Computer Design (ICCD)*, pp. 377383, 2013.
- 13. Jose, J., Jonna, G. R., Radhakrishnan, R., Mutyam, M.: Minimally Buffered Single-Cycle Deflection Router. In: Design, Automation and Test in Europe (DATE), 2014.

- 14 K.S. Midhula et al.
- Michelogiannakis, G., Sanchez, G. D., Dally, W. J., Kozyrakis, C.: Evaluating bufferless flow-control for on-chip networks. In: *Proceedings of the Fourth International Symposium on Networks-on-Chip*, pages 9-16, 2010.
- Jose, J., Nayak, B., Kumar, K., Mutlu, O.: DeBAR: Deflection Based Adaptive Router With Minimal Buffering. In: Design, Automation and Test in Europe (DATE)13, pp. 1583-1588, 2013.
- Atagoziyev, M.: Routing Algorithms For On Chip Networks. M. Sc. Thesis Middle East Technical University, 2007.
- Hayenga, M., Jerger, N. E., Lipasti, M.: SCARAB: A single cycle adaptive routing and bufferless routing. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 244254, 2009.
- Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., Hestness, J., Hower, D. R., Krishna, T., Sardashti, S., Sen, R., Sewell, K., Shoaib, M., Vaish, N., Hill, M. D., Wood, D., A.: The gem5 Simulator. In: ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1-7, 2011.