# High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter Based NEDA Technique

Ambikesh Prasad Gupta, Prof. Shweta Singh IES College of Technology Bhopal M.P.

**Abstract**: In this paper, we describe an efficient Xilinx Virtix4 discrete wavelet transform (DWT) using 9/7 filter based new efficient distributed arithmetic (NEDA) Technique. We demonstrate that NEDA is a very efficient architecture with adders as the main component and free of ROM, multiplication, and subtraction. This technique supports any size of image pixel value and any level of decomposition. The bit-parallel structure has 100% hardware utilization efficiency. Compared with the existing multiplier-less structures, the proposed structures offer significantly higher throughput rate and involve less area-delay product. **Keywords:** Discrete Wavelet Transform (DWT), NEDA, Xilinx Simulation, Synopsis Simulation.

#### Introduction

The discrete Wavelet Transform (DWT) holds both time and frequency information, based on a multiresolution analysis framework when compared to traditional transforms such as the Fast Fourier Transform (FFT) Short Time Fourier Transform (STFT) and the Discrete Cosine Transform (DCT).

I.

Discrete wavelet transform has been used in many fields, such as image and signal processing, signal compression, inter stellar data analysis, digital fingerprints, noise reduction, bio-informatics, geophysics, etc [1]. The well-known image coding standards, namely, MPEG-4 and JPEG2000 have adopted DWT as the transform coder due to its remarkable advantages over the other transforms. For lossy compression, Daubechies 9/7 orthogonal filter is used as the default wavelet filter in JPEG 2000. Efficient implementation of DWT using 9/7 filters in resource-constrained hand-held devices with capability for real-time processing of the computation-intensive multimedia applications is, therefore, a necessary challenge. Multiplier-less hardware implementation approach provides a kind of solution to this problem due to its scope for lower hardware-complexity and higher throughput of computation.

Several parallel and pipeline systems that meet the computational requirements of the discrete wavelet transform have been proposed. Some of them need multiprocessor to implement it and the system is complex, time consuming, and costly [1]. The Field programmable gate array (FPGA) provides us a new way to digital signal processing [2].

Several designs have been proposed for the multiplier, multiplier-less implementation of 1-D DWT based on the principle of multiplier based design (MBD) distributed arithmetic (DA) canonic signed digit (CSD), [1]–[3]. The structure of distributes the bits of the fixed coefficients instead of the bits of input samples. Consequently, the adder-complexity of the structure of depends on the DA-matrix of the fixed coefficients [2].

Canonic signed digit (CSD) are popular for representing a number with fewest number of non-zero digit. The CSD representation of a number contains the minimum possible number of nonzero bits, thus the name canonic. The CSD representation of a number is unique and CSD numbers cover the range (-4/3, 4/3), out of which the value in the range  $\{-1, 1\}$  are of greatest interest.

Martina *et al* [5] have approximated the 9/7 filter coefficients and performance of a hardware implementation of the 9/7 filter bank depends on the accuracy of coefficients representation. By that approach, they have significantly reduced the adder-complexity of the 9/7 DWT. Gourav *et al* [7] have suggested an LUT-less DA-based design for the implementation of 1-D DWT. They have eliminated the ROM cells required by the DA-based structures at the cost of additional adders and multiplexors.

Some of them need Rom to implement it and the system is complex, time consuming, and costly [4] The adder-complexity of this structure is significantly higher than the other multiplier-less structures. In this paper, we have proposed an efficient scheme to derive NEDA-based bit-parallel structures, for low-hardware and high-speed computation DWT using 9/7 filters [4].

The remainder of the paper is organized as follows: mathematical formulation of NEDA-based computation of DWT using 9/7 filter is presented in Section II. The proposed structures are presented in Section III. Hardware and time complexity of the proposed structures are discussed and compared with the existing structures in Section IV. Conclusion is presented in Section V.

#### II. Mathematical Derivation Of Neda

Let us consider the following sum of products [4]:

$$G = \sum_{k=1}^{L} B_k \times C_k \tag{1}$$

Where  $B_k$  are fixed coefficients and they  $C_k$  are the input data words. Equation (1) can also be written in the form of a matrix product as:

$$G = \begin{bmatrix} B_1 & B_2 & \dots & B_L \end{bmatrix} \begin{bmatrix} C_1 \\ C_2 \\ \vdots \\ C_L \end{bmatrix}$$
(2)

Both  $B_k$  and  $C_k$  are in two's complement format. The two's complement representation of  $B_k$  may be expressed as

$$B_{k} = -B_{k}^{M} 2^{M} + \sum_{i=N}^{M-1} B_{k}^{i} 2^{i}$$
(3)

Where  $B_k^i = 0$  or 1, and i = N, N+1... M and  $B_k^M$  is the sign bit and  $B_k^N$  is the least significant bit (LSB). Equation (3) can be expressed in matrix form as:

$$B_{k} = \begin{bmatrix} 2^{N} & 2^{N+1} & \dots & 2^{M} \end{bmatrix} \begin{bmatrix} B_{k}^{N} \\ B_{k}^{N+1} \\ \vdots \\ -B_{k}^{M} \end{bmatrix}$$
(4)

Similarly  $C_k$  can be represented in two's complemented format as:

$$C_{k} = -C_{k}^{X} 2^{X} + \sum_{i=W}^{X-1} C_{k}^{i} 2^{i}$$
(5)

Where  $C_k^i = 0$  or 1, and i = W, W+1, ..., X and  $C_k^M$  is the sign bit and  $C_k^N$  is the least significant bit (LSB). Now on combining equations (1) and (3), we get-

$$G = -(G^{M}.2^{M}) + \sum_{i=N}^{M-1} (G^{i}.2^{i})$$
(6)

Where

$$G^{i} = \sum_{k=1}^{L} B_{k}^{i} C_{k}, i = N, N+1...M$$

## III. Proposed Architecture

In this paper, we have proposed a multiplier-less architecture for 9/7 wavelet Filter by using NEDA. The filter coefficients of 9/7 wavelet filter are given in table1. We multiply the filter coefficients by 100 for simplification. The mathematical calculation for high pass output is explained by an example.

|      | Coefficients      | Multiplied by<br>100 | 6 bit binary representation<br>with 2's complement of |  |  |
|------|-------------------|----------------------|-------------------------------------------------------|--|--|
|      |                   |                      | negative no.                                          |  |  |
| h(0) | 0.60294901823636  | 60                   | 111100                                                |  |  |
| h(1) | 0.26686441184287  | 26                   | 011010                                                |  |  |
| h(2) | -0.07822326652899 | -7                   | 001001                                                |  |  |
| h(3) | -0.01686411844287 | -1                   | 000011                                                |  |  |
| h(4) | 0.02674875741081  | 2                    | 000010                                                |  |  |
| g(0) | 0.5575435262285   | 55                   | 110111                                                |  |  |
| g(1) | -0.29563588155713 | -29                  | 100011                                                |  |  |
| g(2) | -0.02877176311425 | -2                   | 000110                                                |  |  |
| g(3) | 0.04563588155713  | 4                    | 000100                                                |  |  |

Table 1: Filter Coefficients of 9/7 Wavelet Filter

Where h(0), h(1),... h(4) are the Low pass filter coefficients and g(0),g(1)...g(3) are the High pass filter coefficients.

If we take the high pass coefficients g(0),g(1),g(2) and g(3), and multiply by r(1),r(2),r(3) and r(4) then we get the High pass output  $Y_{H}$  of the 9/7 filter as [6]:

$$Y_{H} = \begin{bmatrix} g(0) & g(1) & g(2) & g(3) \end{bmatrix} \begin{vmatrix} r(1) \\ r(2) \\ r(3) \\ r(4) \end{vmatrix}$$
(7)

Where r(1)=x(1)+x(n-6), r(2)=x(n-1)+x(n-5), r(3)=x(n-2)+x(n-4), r(4)=x(n-3). Let r(1)=1, r(2)=2, r(3)=3, r(4)=4 then

$$Y_{H} = \begin{bmatrix} 55 & -29 & -2 & 4 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} = 7$$
(8)

Now if we implement this with NEDA then

$$Y_{H} = \begin{bmatrix} 110111 & 100011 & 000110 & 000100 \end{bmatrix} \begin{bmatrix} r(1) \\ r(2) \\ r(3) \\ r(4) \end{bmatrix}$$
(9)



Figure 1: proposed Multiplier-less 9/7 Wavelet filter using NEDA Technique

Now we can make the DA matrix by the filter coefficients as

$$\begin{bmatrix} B_k \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \end{bmatrix}$$
(10)  
$$Y_H = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \end{bmatrix} \begin{bmatrix} r(1) \\ r(2) \\ r(3) \\ r(4) \\ r(3) \\ r(4) \end{bmatrix} = \begin{bmatrix} r(1) + r(2) \\ r(1) + r(2) + r(3) \\ r(1) + r(3) + r(4) \\ 0 \\ r(1) \\ r(1) + r(2) \end{bmatrix}$$
(11)

Following the proposed multiplier-less 9/7 wavelet filter using NEDA technique in show the Figure 1. In this Figure used five low-pass filter and four high-pass filter coefficients. Generation calculation of the NEDA technique of the high-pass filters output show the Figure 2 (a).

In show the Figure 2 (b), Mathematical calculation of the NEDA Scheme of the High-pass Wavelet Filter Output. In step 1, all input add sign extetion bit o and 1, input are positive add 0 sign extetion bit and input are negative add 1 sign extetion bit. In step 2, after sign extetion all sign extetion bit apply adder matrix butter-fly Equation (11). In step 3,  $Y_P(0)$  is right shift 1 bit and add the  $Y_P(1)$  and so on in Figure 2 (b).

#### IV. Simulation Result

The proposed architecture has very low hardware complexity compared to DA based structures, because DA requires ROM. In the proposed architecture, calculate the high-pass and low-pass wavelet filter output using NEDA scheme. NEDA does not require ROM. Furthermore, only one type of operations-addition, take place during the intermediate stages of computation, greatly simplifying hardware design. What needs special care is the sign output from the adder array, which is simply taking two's complement. In the above

example, invert-and-add-1 is all one needs to convert "0011" to "1101= $Y_{P4}$  Proposed structure consist only 29 adders, zero mux and 27 registers. In the proposed architecture is better than other architecture in shown the Table 2.



Figure 2 (a): Generation calculation of the NEDA Technique of the High-pass Wavelet Filter Output



Figure 2 (b): Mathematical calculation of the NEDA Scheme of the High-pass Wavelet Filter Output

 Table 2: Comparison of proposed with existing architectures MUX: Multiplex, REG: Register, CP: Cyclic Period

| Architecture       | Adder | MUX | REG | СР               |
|--------------------|-------|-----|-----|------------------|
| Alam et al. [2]    | 43    | 0   | 9   | 6 T <sub>A</sub> |
| Martina et al [5]  | 27    | 0   | 9   | 6 T <sub>A</sub> |
| Martina et al. [6] | 19    | 8   | 9   | 6 T <sub>A</sub> |
| Gaurav et al. [7]  | 15    | 40  | 9   | 6 T <sub>A</sub> |
| Proposed           | 29    | 0   | 27  | 6 T <sub>A</sub> |

Xilinx Simulation

Implementation the Gourav et al. [5] and proposed architecture has been captured by VHDL and the functionality is verified by RTL and gate level simulation. Comparisons of DA based architecture and NEDA based architecture for 9/7 wavelet filter in show the Table 3.

Table 3: Xilinx Simulation of DA based architecture and NEDA based architecture for 9/7 wavelet filter.

| Architecture  | Number of<br>Slices | Number of<br>Slice Flip<br>Flops | Number of<br>4 input<br>LUTs | Maximum<br>path delay<br>(nsec) |
|---------------|---------------------|----------------------------------|------------------------------|---------------------------------|
| Gaurav el al. | 189                 | 49                               | 347                          | 21.245                          |
| Proposed      | 135                 | 32                               | 225                          | 18.165                          |

Synopsis Simulation

To estimate the timing, area and power information for ASIC design, we have used Synopsys Design Compiler to synthesize the design into gate Level. Comparison of Synopsys result in the DA based architecture and NEDA based architecture is given in Table 4.

 Table 4: Comparison of DA based architecture and NEDA based architecture for 9/7 wavelet filter.

 ADP: Area delay Product

| Architectur<br>e    | Required<br>time<br>(n sec) | Power<br>(µW) | Area<br>(µm <sup>2</sup> ) | ADP<br>(µm <sup>2</sup> -sec) |
|---------------------|-----------------------------|---------------|----------------------------|-------------------------------|
| Gaurav et<br>al [5] | 20.50                       | 78.471        | 10572.4<br>5               | 216735.23                     |
| Proposed            | 19.80                       | 43.314        | 9553.80                    | 189165.24                     |

## V. Conclusion

We propose a novel distributed arithmetic paradigm named NEDA for VLSI implementation of digital signal processing (DSP) algorithms involving inner product of vectors and vector-matrix multiplication.

Mathematical proof is given for the validity of the NEDA scheme. We demonstrate that NEDA is a very efficient architecture with adders as the main component and free of ROM (free memory), multiplication, and subtraction. For the adder array, a systematic approach is introduced to remove the potential redundancy so that minimum additions are necessary. NEDA is an accuracy preserving scheme and capable of maintaining a satisfactory performance even at low DA precision.

Comparison Xilinx and Synopsis Simulation result architecture is suitable for high speed on-line application. With this architecture the speed of the 9/7 wavelet filter transform is increased, occupied area of the circuit is reduced about 18-25% in the previous DA based architecture and reduced the power about 16-22% in the previous DA based architecture. It has 100% hardware utilization efficiency.

#### References

- S.G. Mallat, "A Theory for Multiresolution Signal Decomposition: The Wavelet Representation", IEEE Trans. on Pattern Analysis on Machine Intelligence, 110. July1989, pp. 674-693.
- [2] M. Alam, C. A. Rahman, and G. Jullian, "Efficient distributed arithmetic based DWT architectures for multimedia applications," in Proc. IEEE Workshop on SoC for real-time applications, pp. 333 336, 2003.
- [3] X. Cao, Q. Xie, C. Peng, Q. Wang and D. Yu, "An efficient VLSI implementation of distributed architecture for DWT," in Proc. IEEE Workshop on Multimedia and Signal Process., pp. 364-367, 2006.
- [4] Archana Chidanandan and Magdy Bayoumi, "AREA-EFFICIENT NEDA ARCHITECTURE FOR THE 1-D DCT/IDCT," ICASSP 2006.
- [5] M. Martina, and G. Masera, "Low-complexity, efficient 9/7 wavelet filters VLSI implementation," IEEE Trans. on Circuits and Syst. II, Express Brief vol. 53, no. 11, pp. 1289-1293, Nov. 2006.
- M. Martina, and G. Masera, "Multiplierless, folded 9/7-5/3 wavelet VLSI architecture," IEEE Trans. on Circuits and syst. II, Express Brief vol. 54, no. 9, pp. 770-774, Sep. 2007.
- [7] Gaurav Tewari, Santu Sardar, K. A. Babu, "High-Speed & Memory Efficient 2-D DWT on Xilinx Spartan3A DSP using scalable Polyphase Structure with DA for JPEG2000 Standard," 978-1-4244-8679-3/11/\$26.00 ©2011 IEEE.