

Volume-11, Issue 11, November 2022 JOURNAL OF COMPUTING TECHNOLOGIES (JCT)

International Journal

Page Number: 01-06

## DESIGN AND IMPLEMENTATION OF FPGA FOR 32 BIT APPROXIMATE SQUARE ROOT APPLICATIONS

Ginni Rajpal, Prof. Shivraj Singh, Dr. Vikas Gupta <sup>1</sup>M.Tech Scholar, <sup>2</sup>Assistant Professor, <sup>3</sup>Professor & Head <sup>1,2,3</sup>Department of Electronics and Communication Engineering (*ECE*) <sup>1,2,3</sup>Technocrats Institute of Technology (*TIT*), Bhopal, (M.P.), India <sup>1</sup>mailstoashmit@gmail.com, <sup>2</sup>86.shivrai@gmail.com

Abstract—The intimidating sight of a square root symbol may make the mathematically-challenged cringe, square root problems are not as hard to solve as they may first seem. Simple square root problems can often be solved as easily as basic multiplication and division problems. More complex square root problems, on the other hand, can require some work, but with the right approach, even these can be easy. This dissertation presents implementation of VLSI Architecture of 32 bit approximate square root for DSP-FPGA applications. We propose a new non-restoring square root algorithm that requires neither square roots nor multiplexors. Compared with previous square root algorithms and our algorithm will be efficient for VLSI implementation. It will be generated the correct resulting value at each iteration and does not require extra circuitry for adjusting the result bit. The operation at each iteration will be simple: addition or subtraction based on the result bit generated in previous iteration. Simulation and synthesis is done using Xilinx ISE 14.7 software with verilog coding. Simulation results show that the proposed code is extension of the existing work. The proposed square root is implemented for the 32 bit square root while previous it is designed for the 16 bit. The total area is optimized 245 number of component or 2.66% while previous its is 536. The delay is 3.01ns while previous it is 3.69ns. The frequency optimized is 293 MHz by proposed and 107.50 MHz by the previous. Therefore proposed square root VLSI code gives the better performance in terms of the calculated parameters.

Keywords - Field Programmable Gate Arrays (FPGAs); Integrated Development Environment (IDE); Root-Mean-Square (RMS); Delta Sigma Modulator (DSM); Register Transfer Level (RTL), etc.

## I. INTRODUCTION

Variable precision floating point operations are widely used in many fields of computer and IOT engineering. Floating point arithmetic operations are included in most processing units. There are many floating-point operations including addition, subtraction, multiplication, division, reciprocal and square root. The Northeastern Reconfigurable Computing Lab has developed its own variable precision floating point library called VFLOAT, which is vender agnostic, easy to use and has a good tradeoff between hardware resources, maximum clock frequency and latency. Field Programmable Gate Arrays (FPGAs), due to their flexibility, low power consumption and short development time compared to Application Specific ICs (ASICs), are chosen as the platform for the VFloat library to run on. Very-high-speed integrated circuits Hardware Description Language (VHDL) is used to describe these components. Xilinx and Altera are the two main suppliers of programmable logic devices. Each

company has its own Integrated Development Environment (IDE). Both IDE from Altera and Xilinx have been used to implement this cross-platform project.

## FLOATING POINT REPRESENTATION

There are two main methods to represent a numeric value in a computer. One is fixed point format, and the other is floating point format. The difference between these two methods is that the former one has a radix point which has a fixed location to separate the integer and fractional parts of the numeric value. The floating-point format has three parts to represent the numeric value: the signed bit, the exponent and the mantissa. The advantage of using the floating point format is that floating point can represent a wider range of values than fixed point format when using the same number of bits. The floating-point format has been standardized and is widely used for scientific computation. This minimizes anomalies and improves numerical quality. IEEE 754 is the technical standard for floating point computation created by the Institute of Electrical and Electronics Engineers (IEEE) in 1985 and

updated in 2008 [2]. In this standard, the arithmetic format of a floating point number is defined as follows. We define b as the base which is either 2 or 10, s as the sign bit, e as the exponent and c as the mantissa. For our project, b = 2. Three binary floating point basic formats are single precision, double precision and quadruple precision.

## **OPERATION**

To convert a number from a fixed point type with scaling factor R to another type with scaling factor S, the underlying integer must be multiplied by R and divided by S; that is, multiplied by the ratio R/S. Thus, for example, to convert the value 1.23 = 123/100 from a type with scaling factor R=1/100 to one with scaling factor S=1/1000, the underlying integer 123 must be multiplied by (1/100)/(1/1000)=10, yielding the representation 1230/ 1000. If S does not divide R (in particular, if the new scaling factor S is greater than the original R), the new integer will have to be rounded. The rounding rules and methods are usually part of the language's specification. To add or subtract two values of the same fixed-point type, it is sufficient to add or subtract the underlying integers, and keep their common scaling factor. The result can be exactly represented in the same type, as long as no overflow occurs (i.e. provided that the sum of the two integers fits in the underlying integer type). If the numbers have different fixed-point types, with different scaling factors, then one of them must be converted to the other before the sum.

## **II. LITERATURE REVIEW**

C. Popa et al. The primary benefits of the new proposed CMOS dynamic resistor are the further developed linearity, the little region utilization and the further developed recurrence reaction. A unique procedure for linearizing the I(V) normal for the dynamic resistor is proposed, in view of the usage of another straight differential enhancer and on a currentpass circuit. The linearization of the first differential construction is accomplished by repaying the quadratic quality of the MOS semiconductor working in the immersion locale by unique square-root circuits. The errors presented continuously request impacts is assessed, while the recurrence reaction of the circuit is excellent because of biasing all MOS semiconductors in the immersion area and of the current-mode activity of the square-root circuits. The dynamic resistor is executed in 0.35mum CMOS innovation, the Zest affirming the hypothetical reproduction assessed outcomes and showing a linearity error under a percent for a drawn out input range (plusmn400mV)and a little worth of the stockpile voltage (plusmn2.5V).

**Tung-Chien et al.,[19]** On-chip spike location and head part examination (PCA) arranging equipment in an incorporated multi-channel neural recording framework is exceptionally wanted to facilitate the transmission capacity bottleneck from high-thickness microelectrode exhibit embedded in the cortex. In this work, we propose the principal driving eigenvector generator, the key equipment module of PCA, to empower the entire system. In view of the iterative eigenvector refining calculation, the proposed flipped structure empowers the minimal expense and low power execution by disposing of the division and square root equipment units. Further, the proposed versatile level moving plan streamlines the exactness and region compromise by progressively expanding the quantization boundary as indicated by the sign level.With the determination of four head parts/channel, 32 examples/spike, and nine pieces/test, the proposed equipment can prepare 312 channels each moment with 1MHz activity recurrence. 0.13 mm 2 silicon region and 282µW power utilization are needed in 90 nm 1P9M CMOS process.

F. Chen et al., [18] Low-k dielectrics, which are useful for chip opposition capacitance ( RC ) defer improvement, crosstalk-commotion minimization, and power-dispersal decrease, are key for the constant scaling of cutting edge VLSI circuits, especially that of superior execution rationale circuits. In this work, а few basic difficulties for Cu/low-k time-subordinate (TDDB)dielectric-breakdown dependability capability will be explored. Initial, a low-k TDDB fieldspeed increase model and its assurance will be talked about. Second, the plainly visible interconnect line-to-line dividing variety across the wafer and the infinitesimal lineto-line separating nonuniformity initiated by line-edge harshness inside a similar test structure and their effects on low-k TDDB unwavering quality will be painstakingly inspected. The Weibull shape-boundary reliance on applied pressure voltage because of such worldwide and nearby dispersing varieties will be examined. At long last, the dampness impact on low-k TDDB and capacitance dependability to act as an illustration of the effect from process joining will be accounted for, exhibiting that low-k TDDB is delicate to back-stopping point reconciliation.

N. Arya et al.,[1] approximate computing is force for plan of exactness acquiring energy configurable circuits for blunder open minded applications. In this work, a low power diminished region square root (SQR) circuit is introduced that accomplishes amazing region and energy proficiency, while presenting unimportant mistakes in the outcomes. Two estimated plans are proposed for reestablishing cluster based square-pull circuit for mistake tough applications. In the principal configuration, estimated reestablishing subtractor cells are utilized to substitute definite SQR subtractor cells by working on the boolean articulations. The subsequent plan lessens the plan intricacy and expands energy productivity by utilizing the exemplary rough figuring strategy of spot truncation. Both the plans are executed for 8-and 16-bit square-root circuit plan with various upsides of guess boundary 'd' for accomplishing different plan compromise focuses. Our results demonstrate that the proposed approximated SOR plans show an improvement as far as postponement, power utilization, energy and region and works on these boundaries on normal by 37%, 24%, 18%, and 44%, separately for

16-bit SQR plans when executed on CMOS 45-nm innovation hub without compromising much on precision. Likewise, the proposed surmised plans are tried on blunder open minded applications including contrast improvement for clinical pictures and envelope recognition in AM (Abundancy Tweak) correspondence frameworks.

R. Nayar et al.,[2] This work presents another equipment improved surmised adder that has basically zero normal blunder and an ordinary i.e., a Gaussian mistake dissemination. We call the proposed surmised adder HOAANED, which is extended as equipment upgraded rough adder with an ordinary mistake dissemination. We considered the utilization of HOAANED for computerized picture handling close by the exact adder and numerous other rough adders for a reasonable examination. Specifically, the augmentations engaged with performing quick Fourier Change and backwards quick Fourier change activities to reproduce the pictures were executed utilizing precise and inexact adders independently. We observed that HOAANED further develops the pinnacle signal- tocommotion proportion of the recreated pictures better contrasted with the other rough adders. Further, HOAANED has enhanced plan measurements. This perception depends on actual execution utilizing a 32/28nm CMOS standard advanced cell library. Moreover, in view of the mistake examination of numerous 32-digit rough adders utilizing 1,000,000 irregular info vectors we observed that HOAANED has for all intents and purposes zero normal blunder, an improved root mean square blunder and a typical mistake circulation.

Y. Fu,et al.,[3] This article presents a 32-GHz persistent recurrence tweaked wave (FMCW) modulator in light of the stage locked circle (PLL) with settled sub-PLL structure in a 65-nm CMOS process. With the sub-PLL, the low-pass impact in stage space is acknowledged, decreasing the clamor collapsing impact, quantization commotion, and spikes because of the delta sigma modulator (DSM). To accomplish great solidness and stage commotion execution, both the stage space model and the stage clamor model are dissected and mimicked. In light of these models, the trill linearity is talked about and recreated, which assists with deciding the plan boundaries and checks the linearity improvement. The estimation results outline that in fragmentary N mode, the settled PLL accomplishes the stage commotion of - 91 dBc/Hz at 1-MHz offset recurrence and the partial prods of not exactly - 54 dBc at 30.78-GHz yield recurrence. In FMCW mode, the proposed modulator accomplishes a three-sided tweet with 1.08-2.16-GHz data transfer capacity at around 32-GHz focus recurrence. Furthermore, the deliberate root mean square (rms) recurrence blunders of 400 and 770 kHz are accomplished with the incline slants of 1.08 GHz/93 µs and 2.16 GHz/93 µs, separately.

T. Fujibayashi et al.,[4] An exactly stage controlled transmitter working in 76 -to 81-GHz for the auto radar application is introduced. To accomplish exact stage control, a clever stage identifier utilizing third request mutilation is utilized to repay the transmitter stage mistake. The multi-channel transmitter utilizing this finder accomplishes under 0.6  $^\circ$ root-mean-square (RMS) gradually work blunder in 76to 81-GHz recurrence range. Since the proposed stage identifier doesn't depend on the other TX channels, it's not difficult to expand the quantity of channels. This proposed transmitter is executed in 65-nm CMOS innovation. The stage locator consumes 1.8mW per channel.

# III. PROBLEM IDENTIFICATION & OBJECTIVE

The variety of computer arithmetic techniques can be used to implement a square root. Most techniques involve computing a set of partial products, and then summing the partial products together. In existing system computing the square root, there are three steps similar to the reciprocal computation. First step is reduction, second is evaluation, and the last step is postprocessing. In the design of most commercial RISC processors, a square root is used for all iterations of div or sqrt instructions.

There is still some of the limitation or challenges in square root detection code so the observed problem formulation is as followings-

- The existing technique based circuit complexity is high for detection of the square root, it generate more number of carry adder.
- The square root requires a rather large number of gate counts.
- It is impractical to place as many square roots as required to realize fully pipelined operation for division and square root instructions.
- The existing circuits give more latency and consume more power during operation.

# IV. PROBLEM IDENTIFICATION & OBJECTIVE

The proposed system proposes a new non-restoring square root algorithm that requires neither square roots nor multiplexors. Compared with previous square root algorithms, proposed algorithm is very efficient for VLSI implementation. This is feasible for the digital signal processing and field programmable gate array applications.

Step-

Firstly assign the input numbers in form of binary,

decimal or Hexa form.

Now perform the implementation process, it generates the register transfer level (RTL) and technological view.

Now simulate the results in the Xilinx test bench and run the simulation. The square root output will be generated.



#### Figure 4.1: Flow Chart

If its fixed number then square root will be also fixed and accurate and of the number is not fixed or floating then square root output will be nearest fixed number

## V. RESULT AND SIMULATION SIMULATION SOFTWARE

The ISE Design Suite is the Xilinx® design environment, which allows you to take your design from design entry to Xilinx device programming. With specific editions for logic, embedded processor, or Digital Signal Processing (DSP) system designers, the ISE Design Suite provides an environment tailored to meet your specific design needs.Xilinx ISE (Integrated Software Environment) is a software tool produced by Xilinx for synthesis and analysis of HDL designs, enabling the developer to synthesize their designs, perform timing analysis, examine RTL diagrams, simulate a design's reaction to different stimuli, and configure the target device with the programmer.



Figure 5.1: Snap shot of Xilinx ISE Design Suite: Logic Edition

The ISE Design Suite: Logic Edition allows you to go from design entry, through implementation and verification, to device programming from within the unified environment of the ISE Project Navigator or from the command line. This edition includes exclusive tools and technologies to help achieve optimal design results, including the following:

**PlanAhea software** - allows you to do advance FPGA floor planning. The PlanAhead software includes PinAhead, an environment designed to help you to import or create the initial I/O Port list, group the related ports into separate folders called "Interfaces" and assign them to package pins. PinAhead supports fully automatic pin placement or semi-automated interactive modes to allow controlled I/O Port assignment. With early, intelligent decisions in FPGA I/O assignments, you can more easily optimize the connectivity between the PCB and FGPA.

## SIMULATION RESULTS AND DISCUSSION



Figure 5.2: RTL View of Top module

Figure 5.2 presents the top level view of the proposed square root VLSI implementation. P stands for the input which is 32 bit and U stands for the output which is 16 bit.



Figure 5.3: Complete RTL View Figure 5.3 presents the complete RTL or register transfer level view of the proposed square root circuit.



Figure 5.4: Technological View

Figure 5.4 shows the technological view of the 32 bit square root. Here red color shows the wires and green color shows the logic gates.



Figure 5.5: Look up table 4

Figure 5.5 is showing the look up table LUT4\_AA48. The logical function in various combinations is carried out by the chip using the Lookup Table. Any combinatorial logic function can be implemented in a lookup table.

| Device Utilization Summary (estimated values) |      |           |             |  |
|-----------------------------------------------|------|-----------|-------------|--|
| Logic Utilization                             | Used | Available | Utilization |  |
| Number of Slice LUTs                          | 197  | 204000    | 0%          |  |
| Number of fully used LUT-FF pairs             | 0    | 197       | 0%          |  |
| Number of bonded IOBs                         | 48   | 600       | 8%          |  |

Figure 5.6: Using component or area utilization Table 5.1: Simulation Results

| Sr No | Parameter           | Value         |
|-------|---------------------|---------------|
| 1     | Area                | 245 or 2.66 % |
| 2     | Total Delay         | 34.21 ns      |
| 3     | Logic delay         | 3.01 ns       |
| 4     | Power               | 0.45 mw       |
| 5     | Power Delay Product | 1539          |
| 6     | Frequency           | 293 MHz       |
| 7     | Throughput          | 9.3 GHz       |

In table 5.1, simulation parameters are showing which is taken during the execution of Xilinx script. The total utilization of the VLSI architecture or area is 245 components or 2.66%. The total delay or latency value is 34.21 ns. The frequency is 293 MHz and overall throughput is 9.3 GHz.

Table 5.2: Result Comparison

| Sr No. | Parameters | Previous Result [1] | Proposed Result    |
|--------|------------|---------------------|--------------------|
| 1      | Order      | 16 bit square root  | 32 bit square root |
| 2      | Area       | 536                 | 245 or 2.66 %      |
| 3      | Delay      | 3.69 ns             | 3.01 ns            |
| 4      | Power      | 0.87 mw             | 0.45 mw            |
| 5      | Frequency  | 107.50 MHz          | 293MHz             |

Therefore proposed square root VLSI code gives the better performance in terms of the calculated parameters. The proposed code is extension of the existing work. The proposed square root is implemented for the 32 bit square root while previous it is designed for the 16 bit. The total area is optimized 245 number of component or 2.66% while previous its is 536. The delay is 3.01ns while previous it is 3.69ns. The frequency optimized is 293 MHz by proposed and 107.50 MHz by the previous.

## VI. CONCLUSION AND FUTURE WORK

Variable precision fixed and floating point operations have various fields of applications including scientific computing and signal processing. Field Programmable Gate Arrays(FPGAs) are a good platform to accelerate such applications because of their flexibility, low development time and cost compared to Application Specific Integrated Circuits (ASICs) and low power consumption compared to Graphics Processing Units (GPUs). Among those operations, the square root can differ based on the algorithm implemented. They can highly affect the total performance of the application running them. This research proposes a new non-restoring square root algorithm that requires neither square roots nor multiplexors. Compared with previous square root algorithms, our algorithm is very efficient for VLSI implementation. It generates the correct resulting value at each iteration and does not require extra circuitry for adjusting the result bit. The operation at each iteration is simple: addition or subtraction based on the result bit generated in previous iteration. The remainder of the addition or subtraction is fed via registers to the next iteration directly even it is negative. At the last iteration, if the remainder is non-negative, it is a precise remainder. Otherwise, we can obtain a precise remainder by an addition operation.

### **Future Work**

Performance analysis through other new approaches. More parameters can be calculated when use different approaches. Experiential test in real time environment. Implemented multi error correction and detection after square root can be used in real-time IOT based wireless sensor network applications.

## REFERENCE

- [1] N. Arya, T. Soni, M. Pattanaik and G. K. Sharma, "Area and Energy Efficient Approximate Square Rooters for Error Resilient Applications," 2020 33rd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID), 2020, pp. 90-95, doi:10.1109/VLSID49098.2020.00033.
- [2] R. Nayar, P. Balasubramanian and D. L. Maskell, "Hardware Optimized Approximate Adder with Normal Error Distribution," 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2020, pp. 84- 89, doi: 10.1109/ISVLSI49217.2020.00025.
- [3] Y. Fu, L. Li, Y. Liao, X. Wang, Y. Shi and D. Wang, "A 32-GHz Nested-PLL-Based FMCW Modulator With 2.16-GHz Bandwidth in a 65-nm CMOS Process," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 7, pp. 1600- 1609, July 2020, doi: 10.1109/TVLSI.2020.2992123.
- [4] T. Fujibayashi and Y. Takeda, "A 76- to 81-GHz, 0.6° rms Phase Error Multi-channel Transmitter with a Novel Phase Detector and Compensation Technique," 2019 Symposium on VLSI Circuits, 2019, pp. C16- C17, doi: 10.23919/VLSIC.2019.8778158.
- [5] S. U. Rehman, M. M. Khafaji, C. Carta and F. Ellinger, "A 10-Gb/s 20-ps Delay-Range Digitally Controlled Differential Delay Element in 45-nm SOI CMOS," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 5, pp. 1233-1237, May 2019, doi: 10.1109/TVLSI.2019.2894736.
- [6] S. Yang, J. Yin, P. Mak and R. P. Martins, "A 0.0056mm2 –249-dB-FoM All-Digital MDLL Using a Block-Sharing Offset-Free Frequency-Tracking Loop and Dual Multiplexed-Ring VCOs," in IEEE Journal of Solid-State Circuits, vol. 54, no. 1, pp. 88-98, Jan. 2019, doi: 10.1109/JSSC.2018.2870551.
- [7] J. -H. Hsieh, K. -C. Hung, Y. -L. Lin and M. -J. Shih, "A Speed- and Power-Efficient SPIHT Design for Wearable Quality-On-Demand ECG Applications," in IEEE Journal of Biomedical and Health Informatics, vol. 22, no. 5, pp. 1456-1465, Sept. 2018, doi:

10.1109/JBHI.2017.2773097.

 [8] H. Fuketa, S. -i. O'uchi and T. Matsukawa, "A Closed-Form Expression for Minimum Operating Voltage of CMOS D Flip-Flop," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 7, pp. 2007-2016, July 2017, doi: 10.1109/TVLSI.2017.267797.