Contents lists available at ScienceDirect





# Nano Communication Networks

journal homepage: www.elsevier.com/locate/nanocomnet

# Traffic aware routing in 3D NoC using interleaved asymmetric edge routers



# Rose George Kunthara<sup>a,\*</sup>, Rekha K. James<sup>a</sup>, Simi Zerine Sleeba<sup>b</sup>, John Jose<sup>c</sup>

<sup>a</sup> Division of Electronics, School of Engineering, CUSAT, Cochin, India

<sup>b</sup> Department of Electronics & Communication Engineering, Rajagiri School of Engineering & Technology, Cochin, India

<sup>c</sup> Department of Computer Science & Engineering, Indian Institute of Technology Guwahati, India

#### ARTICLE INFO

Article history: Received 28 February 2020 Received in revised form 21 November 2020 Accepted 28 November 2020 Available online 15 December 2020

Keywords: Network-on-Chip Bufferless router Average latency Deflection rate Throughput

#### ABSTRACT

Network-on-Chip (NoC) has emerged as a cost-effective on-chip interconnects solution for Tiled Chip Multi Processors (TCMP) where many computational cores occupy a single chip. The performance of NoC network can be greatly enhanced by incorporating 3D IC technology formed by stacking several active NoC layers using Through Silicon Via (TSV) vertical interconnections. 3D NoC routers improve network throughput and have minimal latency at the cost of increased router area and power dissipation. Performance degradation can occur in 3D structures due to unequal traffic distribution across the chip leading to larger power density and larger on-chip temperature that affect system reliability. In this paper, we come up with an interleaved vertical edge routing design approach in 3D NoC that employs asymmetrical routing algorithm and uses a unique flit prioritization unit for improving performance of bufferless mesh NoC. Experimental results indicate that our proposed router has better network performance with minimal hardware overhead when compared with conventional bufferless networks, engaging same number of routers.

© 2020 Elsevier B.V. All rights reserved.

#### 1. Introduction

With the recent advances in technology scaling, a large number of processing cores can be integrated into a single package. NoC, a packet switched network, has evolved as a viable alternative to overcome the communication issues occurring between different cores in an SoC that employ traditional on-chip interconnect structures. Higher bandwidth, scalability, modular topology, better parallelism, load handling ability and performance are some important advantages that make NoC most preferred choice for TCMP design. In a planar TCMP, various processing elements (PE) are interconnected using a two dimensional mesh topology based NoC. Each PE is connected to a router and all such routers are interconnected using bidirectional links in a 2D mesh. Each router consists of 5 input–output ports which enable packet transmission between neighbouring routers in north, south, east, west directions and to the local PE [1–3].

A packet, which is the basic unit of data transmission between tiles, is again subdivided into flow control units called flits. A flit hops through multiple routers and links during its traversal from source to destination [1]. As the number of cores increases,

\* Corresponding author.

3D mesh NoC is a better design choice compared to planar 2D mesh network. Conventional 3D mesh NoCs make use of 7-port router structures where two TSV based vertical links are used for interlayer communication in addition to the 5 ports [4]. But TSVs incur area overhead which results in reduced wafer utilization and misalignment problems. Thus the number of TSVs used in 3D NoC topologies affect design and performance of 3D based architectures [5].

The routing mechanisms employed in NoC may be based on two architecture types: buffered and bufferless. Buffered routers use store-and-forward wormhole routing mechanism by storing input flits inside the router until they are forwarded through productive output ports [2,3]. The second category, bufferless routers, eliminates usage of buffers together with the complexity of buffer management circuitry, area and power overheads associated with it [6,7]. Deflection routing is the most widely used method for output port allocation in bufferless routers [8, 9]. CHIPPER is an efficient bufferless deflection router in terms of network latency, operating speed and power consumption. Hence, we adopt this 5-port router microarchitecture with enhancements in the proposed work.

This paper proposes a hybrid design by combining the power and area benefits of 2D CHIPPER design that has a 5-port structure with 3D mesh NoC. The objective is to minimize number of vertical interconnections without sacrificing performance. Several layers of 2D mesh NoC are stacked utilizing 3D integration

*E-mail addresses:* rosegeorgekunthara@cusat.ac.in (R.G. Kunthara), rekhajames@cusat.ac.in (R.K. James), simizs@rajagiritech.edu.in (S.Z. Sleeba), johnjose@iitg.ac.in (J. Jose).

and interlayer communication is realized with TSV based interconnection links that are made only at edge routers in an interleaved fashion. A novel flit prioritization unit is incorporated in the proposed router pipeline which minimizes flit deflection rate. On comparing with 2D and 3D counterparts and M-3D (Modified Three Dimensional) CHIPPER [10], our proposed design displays better network performance with balanced link utilization, minimal area footprint and router overhead.

The rest of this paper is organized as follows: Section 2 gives an outline about the related work and Section 3 discusses the motivation for our proposed design. The details about proposed design are given in Section 4 whereas Section 5 discusses about experimental methodology adopted. Results and analysis is given in Section 6 and finally Section 7 concludes the paper.

#### 2. Related work

3D NoC structures enhance system performance manifold by virtue of their better connectivity, decreased hop count, better noise immunity, packaging density and lower power consumption due to short interconnect links [11–13]. Pavlidis et al. compares 2D mesh networks with their 3D counterparts to show superiority of 3D NoC by analysing power consumption and zero-load latency of each network [14]. Li et al. proposes Hybrid 3D NoC-Bus mesh or stacked mesh architecture which is formed by combining packet-switched network and bus link [15]. They replace 7-port conventional 3D routers with 6-port hybrid NoC-Bus 3D routers to improve network performance. An addition arbiter per pillar is also employed for better integration of NoC interface and bus structure.

Matsutani et al. proposes a class of 3D topologies known as Xbar-connected network-on-tiers (XNoTs) for optimum usage of short delay and large density of inter-wafer links [16]. It consists of multiple network layers connected using crossbar switches. Several forms of XNoTs based topologies are analysed with various performance metrics. XNoT reports better throughput, though it has reduced power efficiency due to large vertical switches. MIRA is a 3D stacked NoC router architecture which assumes that the processing cores are also designed in 3D. It uses several active NoC layers and is optimized to have decreased power dissipation and area overhead [17].

Xu et al. evaluates effect of reducing number of TSV links to half and quarter on 3D NoC system performance and functionality [18]. Variable delays and unbalanced allocation of 3D switches in their proposed architectures obstruct network performance. Wang et al. uses partition islands of switches to form areas for allotting same TSV pad for interlayer communication which are handled with serialization logic [19]. With increase in number of switches per TSV bundle, the average packet delay tends to exponentially rise due to serialization across TSV bundle.

Partially connected 3D NoC designs have emerged to overcome high manufacturing cost, large area overhead and low fabrication yield, which are associated with TSV [20,21]. Fu et al. presents a Congestion-aware Dynamic elevator Assignment (CDA) method that considers distance factors and network congestion information to improve network performance [22]. Vahdatpanah et al. proposes an adaptive, deadlock-free and livelock-free routing algorithm to improve performance of partially connected 3D NoC under network congestion by evenly distributing the network traffic [23].

Majority of the work on various techniques to reduce number of vertical links to enhance 3D NoC network performance are done for buffered NoC networks which employ 6-port or 7-port 3D routers. Very few works are only there for bufferless NoC designs in 3D domain. Some of the relevant ones are discussed below. TSV serialization is a bottleneck to NoC performance as it affects latency and available bandwidth of TSV links. Lee et al. suggests a deflection routing mechanism which permits full TSV link utilization using TSV ejection/injection mechanism to achieve better performance and low latency at high traffic [24]. 3DPERM [25] is a single cycle deflection router that uses an output port allocation scheme similar to CHIPPER [9]. Reduced power and area overhead are the highlights of their design when compared to corresponding single-cycle 3D CHIPPER.

Tatas et al. proposes 3DBUFFBLESS, an asymmetrical 3D NoC router, which is buffered in z-dimension and bufferless in x and y dimensions [26]. By effectively combining advantages of buffered and bufferless router architectures, their novel router shows improved routing efficiency with minimum power and area overhead. DoLaR is a two layer NoC design formed by stacking two identical layers of planar 2D mesh network. The router architecture is inherited from 2D CHIPPER and uses a modified routing algorithm for efficient packet routing [27]. In comparison with 2D mesh and torus bufferless NoCs, it exhibits better performance, while operating at same frequency as 2D CHIPPER design.

# 3. Motivation

Three major factors influencing NoC performance that we have explored are router microarchitecture, topology and routing algorithm. Considering area and power efficiency at low and medium network load, bufferless deflection router like CHIPPER is mostly preferred. Fig. 1 shows CHIPPER router microarchitecture with single cycle latency at each stage. CHIPPER ensures that golden flit, which is the highest priority flit in the entire network, gets required output port in each router. Thus golden flit scheme guarantees livelock avoidance. After the golden flit reaches its destination, priority is then passed on to another flit in progress. The remaining non-golden flits undergo random port allocation thereby increasing deflection rate and latency. Golden flit concept alone is not quite efficacious as majority of flits reach their destination without becoming golden. So we propose an additional flit prioritization unit to reorder the non-golden flits and thereby reduce number of deflections.

In a 2D NoC, as network size increases, the average number of intermediate routers traversed by packets drastically rises. Increased hops per packet leads to higher latency and large power dissipation thereby affecting the network performance. These limitations are overcome in 3D NoC formed by stacking several NoC layers and interlayer communication is made by TSV based vertical interconnect. Fig. 2a & b depict 2D mesh and 3D mesh NoC structures respectively. 3D mesh incurs 32% area and 43% power overhead per router compared to its bufferless 2D counterpart.

We perform simulations on 3D mesh NoC using CHIPPER for uniform traffic pattern and analyse link utilization. Link utilization is the amount of traffic flowing through router links. Fig. 3 shows normalized link utilization across all layers of a  $4 \times 4 \times 4$ 3D CHIPPER where traffic is dispensed evenly across the network. In Fig. 3a we observe that link utilization for central routers is very high compared to edge routers in all the layers of NoC. The variation in link utilization is highest for Layer 0 i.e., 55%. This is because of the very nature of XYZ routing algorithm where central routers route majority of traffic to their destination routers. Fig. 3b depicts link utilization through vertical links between adjacent layers for uniform traffic pattern. Even though vertical links are equally distributed inside the chip, it is observed that there is a variation of 81% between highest utilized vertical link and lowest utilized one. Unbalanced link utilization causes proportionate rise in power density and formation of thermal hotspots at the centre of network which adversely affect reliability and average lifespan of the chip.



Fig. 1. Two stage pipeline architecture of CHIPPER [9]. Ejection and Injection units constitute first stage of router pipeline. Permutation Deflection Network (PDN) performs parallel port allocation in second stage leading to low delay inside the router.



**Fig. 2.** Mesh topology based NoC architectures with the same number of routers. (a)  $8 \times 8$  2D mesh employing 5-port router structure (b)  $4 \times 4 \times 4$  3D mesh where all routers are 7-port: one each for north, south, east, west, up, down and one connected to local core.



Fig. 3. Link utilization in a 4  $\times$  4  $\times$  4 3D Mesh NoC for uniform traffic.

We propose a multilayer network built by stacking multiple layers of 2D mesh NoCs. State-of-the-art bufferless 2D mesh NoC systems employ 5-port routers where the edge routers have at least one unused port. We exploit unconnected ports of edge routers to create TSV based vertical interconnections between adjoining layers. As shown in Fig. 4, we make following vertical interconnections to form three types of multilayer connections for  $4 \times 4 \times 4$  mesh NoC:

• Type I - Layer 1 edge routers 16, 20, 24 and 28 are connected to Layer 0 edge routers 0, 4, 8 and 12 through their

unused west ports. Layer 1 edge routers 28, 29, 30 and 31 are attached to Layer 2 edge routers 44, 45, 46 and 47 through south ports. Similarly other edge routers are also interconnected.

• Type II - Layer 1 edge routers 28, 29, 30 and 31 are connected to Layer 0 edge routers 12, 13, 14 and 15 through their unused south ports. Layer 1 edge routers 16, 20, 24 and 28 are linked to Layer 2 edge routers 32, 36, 40 and 44 through their west ports. Subsequent vertical interconnections are made at other edge routers also.



**Fig. 4.** Mesh topology based multilayer connections formed by stacking four layers of  $4 \times 4$  mesh NoC: (a) Type I (b) Type II (c) Type II (M-3D Mesh). All routers are 5-port and same number of vertical links are used in all the three topologies. In M-3D mesh, edge routers follow interleaved asymmetrical vertical interconnection across adjacent layers.



Fig. 5. Two stage pipeline architecture of improved M-3D CHIPPER. Ejection and injection units employed in first pipeline stage same as that of CHIPPER architecture [9]. The second stage comprises of Priority Allocation Unit (PAU) followed by PDU for output port allocation.

• Type III - Consider Layer 1 edge routers 28, 29, 30 and 31. Layer 0 edge routers 12 and 14 are interconnected to 28 and 30 respectively whereas Layer 2 edge routers 45 and 47 are connected to 29 and 31 respectively through their south ports. Also corner routers 28 and 31 are interconnected to Layer 1 and 2 through their west and east ports respectively. Thus adjacent layers are connected through unused ports of edge routers in an interleaved manner.

We carry out simulations on Type I, Type II and Type III NoC using uniform traffic to analyse average flit latency and link utilization. Compared to 3D CHIPPER mesh NoC, Type I, Type II and Type III have latency improvements of 13%, 11% and 18% respectively whereas link utilization reduces by 12%, 10% and 17% respectively. The superior results obtained for Type III multilayer connection makes it a clear design choice in our proposed work. This multilayer connection is referred to as M-3D mesh topology in remaining part of the paper. The interleaved asymmetrical connection followed in M-3D mesh topology is beneficial when source and destination routers are in different layers as flits can have much shorter routes. Thus all the routers follow 5-port structure with an additional low overhead flit prioritization mechanism and an asymmetric routing algorithm which can evenly balance the link utilization and thereby improve network performance.

#### 4. Proposed design

Fig. 5 shows the two stage pipeline router architecture of proposed design: improved M-3D CHIPPER. Flits from nearby routers reach input ports at the beginning of each clock cycle. The

four internal flit channels transport incoming flits that progress through different functional units of router pipeline. At the end of every clock cycle, flits are stored in respective pipeline registers. The essential features of different functional blocks employed in improved M-3D CHIPPER are detailed as follows.

*Ejection and injection unit.* Ejection block ejects flit which is destined to local processing core and supports only one flit ejection per router in a clock cycle. When there are multiple flits to be ejected to same destination local core, only highest priority flit will get ejected through ejection port whereas others are deflected to nearby routers and will eventually come back to the same router in ensuing clock cycles. The injection unit injects new flits from local processing core if any one of internal flit channels is free. Otherwise flits will get queued up at processor core level as they cannot be stored in a bufferless router.

Improved M-3D CHIPPER architecture employs asymmetric routing algorithm described in Algorithm 1 for computing desired output port for flits, as we follow M-3D mesh topology given in Fig. 4(c) [10]. Deflection routing algorithms are generally deadlock free as flits are not buffered inside the router. The simple deterministic XY routing algorithm is used for intra-layer routing. For inter-layer routing, flits are first routed to the nearest edge router which is interconnected to corresponding layer with shortest Manhattan distance.

Consider two different cases to illustrate asymmetry in routing path due to interleaved vertical interconnections across adjacent layers of M-3D mesh topology shown in Fig. 4(c):

• Flit 1(Src router: 14, Dest router: 22): As 14 is at last row and even column of outer layer, south port is taken to reach 30.

Now 30 and 22 are in same layer and will follow XY routing algorithm to reach destination.

• Flit 2 (Src router: 42, Dest router: 6): It passes through routers 43, 27, 23, 7 and 6 by taking east, east, north, east, west and local ports respectively so as to have shortest Manhattan distance.

| Algorithm | 1 | Asymmetric | routing | algorithm | [10] |  |
|-----------|---|------------|---------|-----------|------|--|
|           |   |            |         |           |      |  |

1: Input : current\_router, destination\_router

- 2: Output : output port
- 3: **if** (*current\_router\_layer* == *destination\_router\_layer*) **then** 4: XY routing algorithm
- 5: **else if** (current\_router\_layer is outer or odd layer) **then**
- 6: **if** (current router is at first column and even row) **then**
- 7: output port = west
- 8: else if (current router is at last column and odd row) then
  9: output port = east
- else if (current router is at first row and odd column) then
  output port = north
- 12: else if (current router is at last row and even column) then
  13: output port = south
- 14: else
- 15: output port taken as east or west depending on destination column
- 16: end if
- 17: else
- 18: **if** (current router is at first column and odd row) **then**
- 19: output port = west
- 20: else if (current router is at last column and even row) then
  21: output port = east
- 22: else if (current router is at first row and even column then
  23: output port = north
- 24: else if (current router is at last row and odd column then
  25: output port = south
- 26: **else**
- 27: output port taken as east or west depending on destination column
- 28: end if
- 29: **end if**

*Priority allocation unit.* Our proposed router uses golden flit concept for ensuring livelock avoidance, as in CHIPPER. The function of PAU is to allocate an ordering to all non-golden flits present in pipeline register B. This is done by taking out their destination address fields to compute the number of layers that are to be traversed from current router to reach their respective destinations. So flits are prioritized in such a way that highest priority is assigned to ones that are closer to destination layer or which require least number of layer transitions. This guarantees fairness and progress in flit movement by minimizing unnecessary deflections of non-golden flits.

Permutation Deflection Unit (PDU). When more than one flit require the same output port, port allocation issues emerge in bufferless routers due to absence of buffers. The role of PDU is similar to PDN block employed in CHIPPER where every router input port is efficiently mapped to every single output port. PDU contain four permuter units arranged in two stages as shown in Fig. 5. Each permuter block consists of a  $2 \times 2$  crossbar. Flits that are coming from North and East ports are connected to permuter block P1 whereas flits from South and West ports are linked to P2.

At each permuter block, flit with highest priority wins the arbitration and get productive output while other flit will be sent to remaining output. Consider an example where two non-golden flits Flit1 and Flit2 which are coming through North and East internal flit channels respectively. Also assume that the number

#### Table 1

Categorization of benchmark applications based on cache misses per kilo instructions (MPKI) values.

| Percentage Miss Rate           | Benchmark applications            |  |
|--------------------------------|-----------------------------------|--|
| Low MPKI (less than 5)         | calculix, gobmk, gromacs, h264ref |  |
| Medium MPKI (between 5 and 25) | bwaves, bzip2, gamess, gcc        |  |
| High MPKI (greater than 25)    | hmmer, lbm, leslie3d, mcf         |  |

of layer transitions required by Flit1 is lesser than those of Flit2 and desired output port for Flit1 and Flit2 is West. According to flit priority (Flit1 > Flit2), Flit1 will be routed to permuter block P4 where West port is connected and Flit2 will be automatically deflected to permuter block P3. Thus improved M-3D CHIPPER ensures that in addition to golden flit, flits that are near to destination layer are not deflected by virtue of priority scheme which is based on number of layer transitions required. The latency of port allocation stage is low due to parallel structure of permuter blocks in PDU unit.

### 5. Experimental methodology

We employ Booksim 2.0 [28], an open source cycle accurate NoC simulator, that model conventional VC based NoC router [1]. We modify it to build a two-cycle CHIPPER router [9]. Requisite information is attached to each and every flit so as to facilitate independent routing of all flits present inside a packet, which is the standard norm in deflection routers. We use necessary reassembly mechanism for efficiently handling out-of-order flit delivery. The 140-bit wide flit channel contains 128-bit data field and 12-bit wide control field. On this standard bufferless deflection router, we make modifications to prototype 3D CHIPPER, M-3D CHIPPER and improved M-3D CHIPPER for experimental analysis.

#### 5.1. Synthetic traffic

We use standard synthetic traffic patterns like uniform, transpose, bit-complement and bit-reverse to evaluate performance of improved M-3D CHIPPER against baseline 2D CHIPPER, 3D CHIPPER and M-3D CHIPPER with 64 routers. The traffic pattern decides destination router for each generated flit. For uniform traffic, every router has same possibility to be selected as a destination. For transpose traffic, the destination router for source router at (i,j,k) is (Nx-1-i,Ny-1-j,Nz-1-k) where Nx, Ny and Nz are number of routers across each network dimension. Bit permutation traffic patterns like bit-complement and bit-reverse are found by permuting and then selectively reversing or complementing bits of source router address. After sufficient warm up time, by varying injection rate from zero to network saturation point, network performance metrics such as average latency, deflection rate, link utilization and throughput are computed for all traffic patterns.

### 5.2. Real traffic

We compare the performance of our proposed design against 2D CHIPPER, 3D CHIPPER and M-3D CHIPPER for real application workloads. For that, we use Gem5 simulator to model a 64-core multiprocessor system [29]. We assume that each processing core comprises of an out-of-order x86 processing unit with 4-way set associative, 64 KB private L1 cache and 16-way set associative, 512 KB shared distributed L2 cache. Each processing unit is assigned to run one of the SPEC CPU2006 benchmark application programs [30]. We classify the benchmark applications into various network injection intensity categories as given in Table 1.

| Table 2 |  |
|---------|--|
|---------|--|

| Various benchmark | combinations.                                                    |  |  |
|-------------------|------------------------------------------------------------------|--|--|
| Mix #             | SPEC CPU 2006 Benchmarks                                         |  |  |
| M1                | calculix(16) gobmk(16) gromacs(16) h264ref(16)                   |  |  |
| M2                | bwaves(16) bzip2(16) gamess(16) gcc(16)                          |  |  |
| M3                | hmmer(16) lbm(16) mcf(16) leslie3d(16)                           |  |  |
| M4                | hmmer(16) lbm(16) gromacs(16) h264ref(16)                        |  |  |
| M5                | bwaves(16) bzip2(16) mcf(16) leslie3d(16)                        |  |  |
| M6                | calculix(16) gobmk(16) gamess(16) gcc(16)                        |  |  |
| M7                | calculix(10) gromacs(10) bwaves(10) gamess(10) hmmer(12) mcf(12) |  |  |



Fig. 6. Average flit latency comparison for various synthetic traffic patterns.

Depending on network injection intensity combination of the component benchmarks, we produce 7 multiprogrammed workload mixes as listed in Table 2. Consider Mix M1: out of 64 cores that we prototype, it contains 16 cores running *calculix*, 16 cores running *gobmk*, 16 cores running *gromacs* and last 16 cores running *h264ref* benchmark programs. Likewise, other workload mixes are also formed. The network traffic generated by running real workloads are fed into Booksim to simulate network operations.

# 6. Results and analysis

We compare performance of improved M-3D CHIPPER against baseline 2D CHIPPER, 3D CHIPPER and M-3D CHIPPER in terms of network performance metrics. Simulations are conducted on  $8 \times 8$  mesh with 2D CHIPPER using XY routing,  $4 \times 4 \times 4$  mesh with 3D CHIPPER and XYZ routing,  $4 \times 4 \times 4$  mesh with M-3D CHIPPER and  $4 \times 4 \times 4$  mesh with proposed design employing asymmetric routing algorithm described in Algorithm 1.

# 6.1. Effect on average flit latency

Fig. 6 shows the comparison of average flit latency using various synthetic traffic patterns. It is quite evident that across

all the traffic patterns, our improved M-3D CHIPPER has lower average flit latency by virtue of additional prioritization unit that gives a total ordering among all flits. However, a reduction in average latency owing to larger number of links and lower number of hops is envisaged in 3D CHIPPER NoC. But 3D CHIPPER shows larger average flit latency values for all traffic patterns. This is due to higher number of deflections caused by random port assignment of non-golden flits.

The superior load handling ability of a router is indicated by its larger saturation injection rate. Improved M-3D CHIPPER extends saturation injection point thereby making it a good design option for high injection rate applications. Even so, for network intensive traffic patterns like transpose and bitcomp where only certain regions are stressed, improved M-3D saturates earlier than 3D CHIPPER. This is because 3D CHIPPER is able to sustain more traffic as they have more number of ports.

Average flit latency comparison with various network sizes is shown in Fig. 7. From the graph we can observe that improved M-3D CHIPPER has notable reduction in latency compared to other designs even when the network is scaled up. Fig. 8 depicts average flit latency comparison for various real workloads. Across all benchmark mixes presented in Table 2, we can notice a reduction in average flit latency for our proposed design. As expected, M1 has minimum latency whereas M3 has maximum latency as they consist of low MPKI and high MPKI applications respectively.



Fig. 7. Average flit latency comparison with different network sizes.



Fig. 8. Average latency comparison for real applications.

#### 6.2. Effect on deflection rate

Deflection rate refers to average number of deflections undergone by each injected flit. Lower deflection rate is desirable as it indicates smaller dynamic power across NoC links due to minimal activity in the network. With rise in injection rate, deflection rate also goes up because of higher port contentions. Fig. 9 clearly displays deflection rate comparison for different synthetic traffic patterns. Improved M-3D CHIPPER has lower deflection rate specially, at higher injection rates due to prioritization of nongolden flits. The deflection rate comparison for SPEC CPU2006 benchmark applications is shown in Fig. 10. For all mixes, our proposed approach has notable reduction in average flit deflection rate compared to 2D CHIPPER and M-3D CHIPPER. As the deflection rate of 3D CHIPPER is very high owing to pseudorandom arbitration of non-golden flits, they are excluded from the plot.

# 6.3. Effect on throughput

Throughput is a measure of ability of the system to handle peak data rate. In any multilayer NoC network, throughput improvement depends on total number of physical links (both horizontal and vertical) and average hop count. From Fig. 11, we notice that proposed improved M-3D CHIPPER gives better throughput compared to other 5-port router based designs such as 2D CHIPPER and M-3D CHIPPER. As improved M-3D CHIPPER uses higher number of physical links than 2D CHIPPER, there is throughput growth of 17%, 12%, 15% and 33% for uniform, transpose, bitcomp and bitrev traffic respectively. The additional flit prioritization scheme employed in improved M-3D CHIPPER leads to lower average hop count resulting in throughput improvement of 4%, 2%, 7% and 3% over M-3D CHIPPER for above traffic patterns. However, throughput will be best for 3D CHIPPER due to two extra links at each router.

#### 6.4. Effect on link utilization

Fig. 12 shows the comparison of normalized link utilization for various synthetic traffic patterns. For 3D CHIPPER, link utilization is very high at central routers of upper layers as it follows standard XYZ routing algorithm. Compared to 3D CHIPPER, average link utilization reduces by 17.1% and 17.8% for M-3D CHIPPER and improved M-3D CHIPPER respectively. Both M-3D CHIPPER



Fig. 9. Average flit deflection rate comparison for various synthetic traffic patterns.



Fig. 10. Average deflection rate comparison for real applications.



Fig. 11. Throughput comparison for various synthetic traffic patterns.

and improved M-3D CHIPPER displays similar link utilization across all the layers. This is because of the asymmetrical routing algorithm employed in both design approaches due to which the traffic is distributed more evenly across chip leading to better link utilization and performance improvement for bufferless mesh NoC.

### 6.5. Thermal analysis

We analyse thermal distribution through NoC using Hotspot 6.0 tool [31]. We use Orion [32] to model our router microarchitecture and extract dynamic power dissipation of all the routers in  $4 \times 4 \times 4$  mesh NoC due to varying load. The power traces found using Orion is fed into Hotspot to assess variation in transient temperature due to flit flow load across NoC routers. Our empirical analysis with synthetic and real workloads reaffirm that the temperature reduced up to 6°K for central routers of upper layers, eliminating formation of thermal hotspots using our proposed design in comparison to 3D CHIPPER.

#### 6.6. Hardware overhead

Verilog HDL models of 2D CHIPPER, 3D CHIPPER, M-3D CHIP-PER and improved M-3D CHIPPER are synthesized using Xilinx Vivado 2018.3 targeted to Xilinx Zynq UltraScale+, XCZU7EV device to calculate router pipeline latency, power and area overhead. Router delay is computed as time taken by a flit to move from input port to output port. Similar functional units are employed in first stage of M-3D CHIPPER and improved M-3D CHIPPER due to which both of them have same delay in their first router pipeline stage. The inclusion of prioritization unit (PAU) followed by parallel port allocator (PDU) has a negligible logic delay of 1% in the second stage of router pipeline when compared to 2D CHIPPER and M-3D CHIPPER. Improved M-3D CHIPPER incurs only 1% and 3.6% more area and power than 2D CHIPPER and M-3D CHIPPER. The hardware overhead is justified by notable



Fig. 12. Link utilization for 3D CHIPPER, M-3D CHIPPER and Improved M-3D CHIPPER.

improvement in network performance. When compared to 3D CHIPPER, our proposed design achieves 25%, 40% and 31% reduction in critical path latency, power consumption and router area respectively.

The overall area overhead in any NoC network consists of router area and wiring overhead. In a conventional 2D mesh NoC, the wiring overhead is due to horizontal links only ( $8 \times 8$  2D mesh uses 112 horizontal links) whereas in a standard  $4 \times 4 \times 4$  3D mesh NoC, there are 96 horizontal links and 48 vertical links. M-3D CHIPPER and our proposed design uses the same number of horizontal links (96 links) as that of 3D network but reduced number of vertical links (24 links), as vertical interconnections are made only at edge routers. As TSVs consume significant metal area and silicon area, there is better savings in terms of area as we have used minimal number of vertical interconnections using TSVs. Overall the chip footprint is same as that of 3D mesh but with 50% less vertical links.

# 7. Conclusion

TCMP require efficient interconnection network designs as network scaling limits the system performance. The combination of NoC and 3D IC technology can significantly improve network performance and power dissipation. In this paper, we propose a unique approach to improve NoC performance by a 3D NoC network employing only 5-port routers. The adjoining layers are connected using unconnected ports at edge routers in an interleaved fashion to achieve shorter routes. An additional low overhead flit prioritization mechanism is employed in each router to order non-golden flits thereby reducing deflections. By virtue of our asymmetrical algorithm, there is a more balanced traffic allocation across the chip without comprising the performance and functionality. From evaluations, we can observe that our improved M-3D CHIPPER outperforms the designs under consideration with minimum hardware modification.

# **Declaration of competing interest**

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

#### References

- W. Dally, B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, USA, 2004.
- [2] William Dally, Route packets not wires: On-Chip interconnection networks, in: DAC, ACM Press, New York, 2001, pp. 684–689.
- [3] W. Dally, Virtual-channel flow control, IEEE Trans. Parallel Distrib. Syst. 3 (2) (1992) 194–205.
- [4] B.S. Feero, P.P. Pande, Networks-on-chip in a three-dimensional environment: A performance evaluation, IEEE Trans. Comput. (2009) 32–45.
- [5] M.O. Agyeman, et al., Performance and energy aware inhomogeneous 3D networks-on-chip architecture generation, IEEE Trans. Parallel Distrib. Syst. 3 (6) (2016) 1756–1769.
- [6] Y. Hoskote, et al., A 5-GHz mesh interconnect for a teraflops processor, IEEE Micro 27 (5) (2007) 51-61.
- [7] M.B. Taylor, et al., Evaluation of the raw microprocessor: An exposed-wiredelay architecture for ILP and streams, in: ISCA, 2004.
- [8] T. Moscibroda, O. Mutlu, A case for bufferless routing in on-chip networks, in: ISCA, 2009, pp. 196–207.
- [9] C. Fallin, et al., CHIPPER: A low complexity bufferless deflection router, in: HPCA, 2011, pp. 144–155.
- [10] R.G. Kunthara, et al., Asymmetric routing in 3D NoC using interleaved edge routers, in: NoCArc, 2019.
- [11] A.W. Topol, et al., Three-dimensional integrated circuits, IBM J. Res. Dev. 50 (4/5) (2006).
- [12] K. Siozios, et al., Three dimensional network-on-chip architectures, in: F. Gebali, H. Elmiligi, M.W. El-Kharashi (Eds.), Networks-on-Chips: Theory and Practice, CRC Press, Boca Raton, FL, USA, 2009, pp. 1–28.

- [13] W.R. Davis, et al., Demystifying 3D ICS: The pros and cons of going vertical, IEEE Des. Test Comput. 22 (6) (2005) 498-510.
- [14] V.F. Pavlidis, E.G. Friedman, 3-D toplogies for networks-on-chip, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. (2007) 1081–1090.
- [15] F. Li, et al., Design and management of 3D chip multiprocessors using network-in-memory, in: Proc. Int. Symp. Comput. Archit., 2006, pp. 130–141.
- [16] H. Matsutani, et al., Tightly-coupled multi-layer topologies for 3D NoCs, in: ICPP, 2007.
- [17] D. Park, et al., MIRA: A multi-layered on-chip interconnect router architecture, in: ISCA, 2008, pp. 251–261.
- [18] T. Xu, et al., A study of through silicon via impact to 3D network-on-chip design, in: ICEIE, 2010, pp. 333–337.
- [19] Y. Wang, et al., Economizing TSV resources in 3D Network-on-chip design, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23 (3) (2015) 493–506.
- [20] M. Bahmani, et al., A 3d-noc router implementation exploiting vertically-partially-connected topologies, in: ISVLSI 2012.
- [21] F. Dubois, et al., Elevator-first: A deadlock-free distributed routing algorithm for vertically partially connected 3d-nocs, IEEE Trans. Comput. 62 (3) (2013) 609–615.
- [22] Yuxiang Fu, et al., Congestion-aware dynamic elevator assignment for partially connected 3D-NoCs, in: ISCAS 2019.
- [23] F. Vahdatpanah, et al., 3DEP: A efficient routing algorithm to evenly distribute traffic over 3D network-on-chips, in: PDP 2019.
- [24] J. Lee, et al., Deflection routing in 3D Network-on-Chip with TSV serialization, in: ASP-DAC, 2013.
- [25] C. Feng, et al., A 1-cycle 1.25 GHz bufferless router for 3D network-on-chip, IEICE Trans. Inf. Syst. E95.D (5) (2012) 1519–1522.
- [26] K. Tatas, et al., 3DBUFFBLESS: A novel buffered-bufferless hybrid router for 3D networks-on-chip, in: PATMOS, 2017.
- [27] R.G. Kunthara, et al., DoLaR: Double layer routing for bufferless mesh network-on-chip, in: 2019 IEEE Region 10 Conference, TENCON, October 2019.
- [28] N. Jiang, et al., A detailed and flexible cycle-accurate network-on-chip simulator, in: ISPASS, 2013.
- [29] N. Binkert, et al., The gem5 simulator, SIGARCH Comput. Archit. News 39 (2) (2011) 1–7.
- [30] SPEC2006 CPU benchmark suite, http://www.spec.org.
- [31] W. Huang, et al., Compact thermal modeling for temperature-aware design, in: DAC, 2004.
- [32] A.B. Kahng, et al., ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration, in: DATE, 2009.



**Rose George Kunthara** is a Research Scholar at Cochin University of Science & Technology (CUSAT), Kerala, India. She got her B.Tech degree in Electronics and Communication Engineering in 2009 from School of Engineering, CUSAT, followed by M.E. in Embedded Systems from Birla Institute of Technology & Science (BITS) Pilani, India. Her areas of interest include Network on Chip architectures and low-power circuits. She is a member ISTE and student member IEEE.



**Rekha K. James** is a Professor at Cochin University of Science & Technology, Kerala, India. She got her B.Tech degree in Electronics and Communication Engineering in 1989 from College of Engineering, Trivandrum (CET), University of Kerala, followed by M.Tech. in Digital Electronics and Ph.D. in Computer Engineering from Cochin University of Science and Technology (CUSAT), Kochi, India. Her research interests include the design of Multicore architectures, Decimal arithmetic, Reversible logic and Low-power circuits. She is a Member IEEE, ISTE (L) (India), and IETE (India).

#### R.G. Kunthara, R.K. James, S.Z. Sleeba et al.



Simi Zerine Sleeba is presently a faculty at Raja-giri School of Engineering and Technology, Cochin, India. She received her B.Tech degree in Electronics & Communication Engineering from Mahatma Gandhi University, India in 1997. She did her Ph.D. and M.Tech in VLSI & Embedded Systems from Cochin University of Science and Technology (CUSAT), India in 2010 and 2018 respectively. Her research interests include on-chip interconnection network architectures and algorithms and low power MPSoC design. She is a member IEEE.



John Jose is an Assistant Professor at the Department of CSE, Indian Institute of Technology Guwahati, India. He did his Ph.D. (CSE) from Indian Institute of Technology Madras, India in 2014. Previously he did his M.Tech from Vellore Institute of Technology, Tamil Nadu, India.

Nano Communication Networks 27 (2021) 100334