**RESEARCH ARTICLE** 

OPEN ACCESS

# **Energy Efficient Bit Extension Type Accelerator Chip for Detection Algorithms**

## Aparna.M.P, Punitha.V

<sup>1</sup>*M*tech student, <sup>2</sup>Assistant professor, Electronics and communication department, AWH Engineering college

## ABSTRACT

This paper presents an energy efficient bit extension type accelerator chip that targets decoding tasks of MIMO(Multiple input multiple output) - orthogonal frequency-division multiplexing (OFDM) systems. The work is motivated by the adoption of MIMO and OFDM by almost all existing and emerging high-speed wireless data communication systems. MIMO is an antenna technology for wireless communications in which multiple antennas are used at both the source (transmitter) and the destination (receiver). MIMO decoder or sometimes called MIMO equalizer detects or decodes or recovers the transmitted signals from multiple antennas. MIMO decoding process for a certain application is hard and time consuming. This motivates the need for a programmable accelerator block to implement the MIMO decoder task as fast and easy application. In this paper proposing a new pipeline architecture in arithmetic units inside the processing core of accelerator chip. The proposed architecture can perform with higher frequency with the help of pipeline structure and also improving the speed of operation of rotation unit with a new arithmetic rotation unit instead of native CORDIC algorithm. This proposed architecture helps to reduce dynamic power consumption. The accelerator is an ideal solution for today's smart phones that implement multiple MIMO-OFDM waveforms on the same platform. *Index Terms*: MIMO, OFDM, CORDIC algorithm

## I. INTRODUCTION

The main reasons why orthogonal frequency division multiplexing (OFDM) was adopted in the wireless communication are its high spectral efficiency and ability to deal with frequency selective fading and narrowband interference. The requirement for wide bandwidth and flexibility imposes the use of efficient transmission methods especially in wireless environment where the channel is very challenging. In wireless environment the signal is propagating from the transmitter to the receiver along number of different paths, collectively referred as multipath. While propagating the signal power drops of due to three effects: path loss, macroscopic fading and microscopic fading. Fading of the signal can be mitigated by different diversity techniques. To obtain diversity, the signal is transmitted through multiple (ideally) independent fading paths e.g. in frequency space or and combined time, constructively at the receiver. Multiple inputmultiple-output (MIMO) exploits spatial diversity by having several transmit and receive antennas. However "MIMO principles" assumed frequency flat channels. OFDM is modulation fading MIMO method known for its capability to mitigate multipath. In OFDM the high speed data stream is divided into Nc narrowband data streams, Nc corresponding to the subcarriers or subchannels i.e.one OFDM symbol consists of N symbols modulated.

In MIMO system multiple antennas are employed at the transmitter and the receiver. MIMO transmits and receives two or more data streams through a single radio channel. Thereby the system can deliver two or more times the data rate per channel without additional bandwidth or transmit Power. In wireless environment the signal is propagating from transmitter to receiver along number of different paths collectively referred as multipath. Propagating the signal power drops due to path loss and fading. Fading of the signal can be mitigated by different diversity techniques. To obtain diversity, the signal is transmitted through ideally independent fading paths. Example in time, frequency or space and combined constructively at the receiver.



Fig 1 MIMO antenna system

MIMO operation requires parallel processing for multiple data streams at the transmitter and, more importantly, at the receiver where the MIMO decoder is notorious for being one of the most processingintensive blocks. A MIMO decoder is the receiver component that separates the Nss transmitted data streams from the signals received on the Nrx receive antennas. MIMO decoding operation is matrix and vector intensive. A decoder design uses a single MIMO decoding algorithm such as zero forcing (ZF), minimum mean square error (MMSE), maximum likelihood (ML), or one of the many sphere decoding (SD) variants. A system designer chooses a single MIMO decoding algorithm[1]. A hardware engineer then implements the chosen algorithm with constraints on complexity, performance, and power consumption-considering the parallel processing requirements for OFDM operation. This design cycle is typically repeated for every new communication standard. With new wireless communication standards and new MIMO decoding algorithms emerging every few years, existing systems need to be redesigned and upgraded not only to meet the newly defined standards, but also to allow integration of multiple standards onto the same platform and improve performance via more advanced decoding algorithms.

Multimode MIMO decoder can be designed to target multiple communication standards. A programmable MIMO decoder design using a software-driven processor that employs several general-purpose floating-point processing units.MIMO decoder or sometimes called MIMO equalizer detects or decodes or recovers the transmitted signals from multiple antennas. MIMO decoding process for a certain application is hard and time consuming. This motivates the need for a programmable accelerator block to implement the MIMO decoder task as fast and easy application.

MIMO accelerator consist of processing core, data memory, instruction memory, core-input switch, controller, phase memory and memory input switch. Processing core is the main element in accelerator. Processing core consist of adder, multiplier, rotation, and reciprocal unit. CORDIC is used as rotation. In existing method, 32 bit processing core is used. If we have 8,16 and 32 bit of operation, 32 bit data is given to the core and 32 bit operations are done. Hence it take more dynamic power to complete the process. The four processing units are most power hungry blocks in the accelerator because of the fact that they are constantly performing complex matrix operations. This meant that for an instruction, the four unit see, and effectively process, both operands even though the result of only one of the processing units is needed. This results in unnecessary dynamic power consumption. To avoid this unwanted power dissipation, additional hardware was introduced.

## II. PROPOSED BIT EXTENSION TYPE ACCELERATOR

In this paper proposing a new pipeline structure in arithmetic units inside processing core of MIMO accelerator architecture. The proposed architecture can perform with higher frequency with the help of pipeline structure and also improving the speed of operation of rotation unit with a new arithmetic rotation unit instead of native CORDIC algorithm. In the proposed method the 32 bit pc is divided into 8bit lanes, each lane will perform 8 bit operations. It can operate 8,16,32 bit or mix of these according to the instruction selection. Hence according to the selection, it can disable the unwanted lanes. And also it can enable the needed part for the operation. This will reduce the dynamic power which is a major source of power consumption. CORDIC algorithm was initially designed to perform a vector rotation. And it is an iterative algorithm which can be used for the computation of trigonometric functions, multiplication and division. CORDIC perform complex phase rotations, and which have an angle to rotate the vector. Hence propose a new arithmetic rotation unit instead of CORDIC. There by improving the speed of operation of rotation unit and can rotate vector by using arithmetic rotation unit easily. The modified architecture is shown below.



Fig 1. Proposed bit extension accelerator



Fig 2 Elaboration of one processing core

This architecture uses a pipeline structure in processing core unit. Processing core unit consist of multiplier unit, rotation unit, reciprocal unit, and adder unit. These all arithmetic unit are arranged in pipeline manner. An extension unit is used in control unit. For example: a 32 bit processing core was equally divided into four 8 bit processing core and it is connected like pipeline structure. According to the incoming instruction bit, extension unit will decide which one of the processing core will be operated. By using this method, at a time all the four operations have to be done. Processing steps described below.

Fetch instruction from instruction memory, Then control unit decode the incoming instruction into commands or control unit works by receiving input instruction that it converts into control signals, which are then sent to processor. The processor then tells the attached hardware what operations to carry out. Instruction selections from instruction memory used to decide which of the arithmetic unit is needed for the operation. And other units are kept disable. Hence we can save power. Control unit used to give address location to data memory and instruction memory. Core-input switch is a two level multiplexing circuit that select and properly arranges the complex vectors needed by the processing core and memory-output switch takes the output of processing unit and packages them and write all data into the appropriate memory location. The four units are described below.

Adder Unit: Four 1) lanes (lane0.lane1.lane2.lane3) consist of adders. In 8bit addition, each adder inside all the lanes is operated. And in 16bit addition, two lanes (lane0 and lane2) are considered at a time. Adder inside lane0 is operated firstly then carry of first adder is propagated to second adder inside lane1.finally we get output. And in 32bit addition, carry of each adder propagates to next adder inside the lane. Instruction enable is in the order of (add\_en, mul\_en, rot, rec). When instruction enable is 1000, and add enable is high that time adder will conduct. Input to the adder is coming from data memory. Input is a complex number. It have a real and imaginary part. That is in the order of Re acc0 &Im\_acc0 - operand 1 and Re\_acc1& Im\_acc1 operand 2. Instruction selection also coming from instruction memory. Instruction selection is 00 that time 8 bit addition will conduct. Instruction selection is 01, 16 bit is conducted and instruction selection is 10, that time 32 bit is conducted. And 11 is the invalid case.



Fig 3 Flow chart of adder

2) Multiplier and Reciprocal Unit: Multiplication is done by dot product.



Fig 4 flow chart of multiplier and reciprocal unit

Incoming input to the multiplier is coming from data memory.

When instruction enable is 0101 that time multiplication and reciprocal are carried out. And correspondingly mul enable and rec are kept in high. Bit selection is same as the adder case.

3) Rotation Unit: when instruction enable is 0010, in that time only rotation is workout. And also rotation

enable is kept high. Bit selection is same as in the above case. Firstly we consider the real part and imaginary part input from memory. In 8bit rotation,8bit value of that input is rotated that means LSB is rotated to MSB. In 16bit rotation, LSB bit of the input of lane 1 is move to MSB bit of lane0. In 32bit rotation, LSB bit of the input of lane0.



Fig 5 Flow chart of rotation

### **III. RESULT ANALYSIS**

The proposed method is compared with the result of existing method shows there is better improvement in power consumption, speed and throughput. In the proposed method 32 bit processing core is equally divided into four 8bit lanes. Hence it can operate 8,16,32bit and mix of these according to the incoming instruction selection. Hence according to the selection, it can disable the unwanted lanes, and also it can enable the needed part for the operation. This will reduce the dynamic power which is a major source of power consumption. But in the existing system only 32bit processor is used. If it have 8bit operation, 32bit data is given to the module and 32bit operations are done. Hence there by occurs a huge dynamic power. For an instruction, the four units see, and effectively process, both operands even though the result of only one of the processing unit is needed. This results in unnecessary power consumption. In the proposed method pipelining architecture is used inside the ALU units and lanes too. Hence there by speed of the operations can be increased. And also arithmetic rotation unit is used instead of native CORDIC. Hence also speed can be improved. In the existing system, in arithmetic units, according to the incoming instruction, arithmetic operations are done. Hence the next instruction is coming only after the current operation in done. Hence it is a time consuming process and there will be a delay on next instruction, hence speed of the operation also reduces. Latency is the time interval between the arrival of input and output whereas throughput is the time interval between the arrival of consecutive inputs. Latency of proposed system shows a better improvement than the existing system.

In the above table we can observe that the accelerator used in MMSE(Minimum mean square error ) have 166Mhz clock frequency and 300.9mw average power consumption. And also by using SVD (singular value decomposition) will get 166 Mhz clock frequency and 300.9mw average power consumption. And by using proposed bit extension accelerator will get 924.556 Mhz clock frequency and 42mw average power consumption. Hence we can analyse that the proposed system will get reduced dynamic power consumption and higher clock frequency compared to other system. Hence speed of the operation also increased compared to other device. Hence the proposed device is better than other.

| Name                         | CLK freq       | Average power<br>consumption |
|------------------------------|----------------|------------------------------|
| Bit extension<br>accelerator | 924.556 Mhz    | 42 mW                        |
| MMSE<br>accelerator          | 166 MHz        | 300.9 <u>mW</u>              |
| QRD<br>accelerator           | 166 <u>Mhz</u> | 300.9 <u>mW</u>              |

Table 1 Comparison with other ASIC designs

www.ijera.com

### **IV. CONCLUSION**

This paper focus on the design of a new accelerator chip for reducing power consumption and improving speed. The overall architecture consists of data memory, instruction memory and processing core unit. Processing core unit consist of adder, rotation, multiplier and reciprocal units. 32 bit core is equally divided into four 8bit lanes such as lane0.lane1.lane2 and lane3. Data is coming from data memory that is a complex value. They have a real part and imaginary part. According to the incoming instruction from instruction memory, the operation is done inside the ALU. By using this method, it can disable the unwanted parts for the operations. This will reduce the dynamic power which is a major source of power consumption. There by speed of the operations also improved. And pipelining arithmetic units are used inside the core. There by also speed is improved.

#### REFERENCES

- [1]. Mohamed I. A. Mohamed, Karim Mohammed and Babak Daneshrad "Energy Efficient Programmable MIMO Decoder Accelerator Chip in 65-nm CMOS" IEEE transactions on very large scale integration (vlsi) systems,2013
- [2]. A.J.Paulraj, D.A.Gore, R.U.Nabar, and H.Bolcskei,
  "An overview of MIMO communications— A key to gigabit wireless," Proc. IEEE, vol. 92, no. 2, pp. 198–218, Feb. 2004
- [3]. K. Mohammed and B. Daneshrad, "A MIMO decoder accelerator for next generation wireless communications," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 11, pp. 1544–1555, Nov. 2010
- [4]. C. Studer, P. Blosch, P. Friedli, and A. Burg, "Matrix decomposition architecture for MIMO systems: Design and implementation tradeoffs," in Proc. Conf. Rec. 41st Asilomar Conf. ACSSC, Nov. 2007, pp. 1986–1990.
- [5]. Gordon I. stüber, fellow, John r. barry,Steve w. mclaughlin, ye (geoffrey) li,mary ann ingram,and Thomas g. Pratt "Broadband MIMO-OFDM Wireless Communications" proceedings of the ieee, vol. 92, no. 2, february 2004
- [6]. A. Omri and R. Bouallegue "New transmission scheme for mimo-ofdm system" International Journal of Next-Generation Networks (IJNGN) Vol.3, No.1, March 2011
- [7]. Soohyun Jang1, Seongjoo Lee 2, and Yunho Jung "Low-Complexity and Low-Power MIMO Symbol Detector for Mobile Devices with Two TX/RX Antennas" journal of semiconductor technology and science, vol.15, no.2, april, 2015
- [8]. H. S. Kim, W. Zhu, J. Bhatia, K. Mohammed, A. Shah, and B. Daneshrad, "A practical,

hardware friendly MMSE detector for MIMO-OFDM based systems," EURASIP J. Adv. Signal Process.,vol. 2008, p. 94, Jan. 2008.

- [9]. C.-J. Huang, C.-W. Yu, and H.-P. Ma, "A power-efficient configurable low-complexity MIMO detector," IEEE Trans. Circuits Syst.I,Reg. Papers, vol. 56, no. 2, pp. 485– 496, Feb. 2009
- [10].M. Ali, K. Mohammed, and B. Daneshrad, "MIMO accelerator: A design flow for a programmable MIMO decoder architecture," inProc. Comput. Conf. Rec. 43rd Asilomar Conf. Signals, Syst., Nov. 2009, pp. 1292– 1296