# High performance DA-based DCT, DWT and DHT

K.Satya Sujith<sup>1</sup>, M.Lavanya Latha<sup>2</sup>

<sup>1, 2</sup>(Electronics And Communication Engineering, Audisankara Institute of Technology/J.N.T.U., INDIA)

**Abstract:** Thee transform coding is a major block in any compression technique. Transform coding helps in decorrelating the input data and for energy compaction which are required to implement compression. DCT is standard transform used in JPEG image compression standard and DWT is used in JPEG2000image compression standard. The DHT is a transform which provides flexibility in terms of hardware as well as reduces complexity in evaluation inverse transform because it is inverse is same as forward transform hence recently DHT is also using in image compression techniques so low hardware required, high speed architecture is required for DCT, DWT and DHT. In this paper DA based optimized adder tree architecture was presented **Keywords:** Adders, DCT- Discrete Cosine Transform, DWT- Discrete Wavelet Transform, DHT- Discrete Hartley Transform, DA- Distributed Arithmetic, OAT- optimized adder tree.

### I. Introduction

Today we are talking about digital networks, digital representation of images, movies, video, TV, voice, digital library-all because digital representation of the signal is more robust than the analog counterpart for processing, manipulation, storage, recovery, and transmission over long distances, even across the globe through communication networks. In recent years, there have been significant advancements in processing of still image, video, graphics, speech, and audio signals through digital computers in order to accomplish different application challenges. As a result, multimedia information comprising image, video, audio, speech, text, and other data types has the potential to become just another data type. Development of efficient image compression techniques continues to be an important challenge to us, both in academia and in industry. Digital image compression is the most important part in the multimedia applications which aims to reduce the number of bits in an image data for its efficient storage (less storage area). JPEG based still image compression follows three steps i.e. transform, quantization and coding to compress an image [1]. Reverse process comprising decoding, quantization and inverse transform is used for image de-compression [2, 3]. Discrete cosine transform (DCT) is used to transform the image from spatial domain to frequency domain. During quantization less important frequencies are discarded and is termed as loss image compression. Brace well has drawn attention to the discrete Hartley transform (DHT) as a substitute for the Discrete Fourier transform (DFT) [4, 5]. Many applications of DHT in signal processing and communications have been presented in the literatures [6-8]. DHT is used in JPEG based image compression.

The discrete wavelet transform (DWT) has been widely used in many areas of science and engineering, e.g., signal and image processing, bio-informatics, geophysics, and meteorology etc. for the applications involving compression and analysis of various forms of data. The well-known image coding standards, namely, MPEG-4 and JPEG2000 have adopted DWT as the transform coder due to its remarkable advantages over the other transforms. Multiplier-less hardware implementation approach provides a kind of solution to this problem due to its scope for lower hardware-complexity and higher throughput of computation. Several designs have been proposed for the multiplier-less implementation of DWT based on the principle of distributed arithmetic (DA).

In section-II DA principle is explained in brief, in section-III optimized adder tree is given

### II. MATHEMATICAL DERIVATION OF DISTRIBUTED ARITHMETIC

Distributed Arithmetic (DA) has been one of the popular techniques to compute the inner product equation in many DSP FPGA applications. It is applicable in cases where the filter coefficients are known a priori. The inner sum of products is rearranged so that the multiply and accumulate (MAC) operation is reduced the inner product is an important tool in digital signal processing applications. It can be written as follows:

$$Y = A^{T} X = \sum_{i=1}^{L-1} A_i X_i \_\_\_\_(1)$$

Where  $A_{i,} X_{i}$  and L are ith fixed coefficient, ith input data, and number of inputs, respectively. Assume that coefficient  $A_{i}$  is Q-bit two's complement binary fraction number. Equation (1) can be expressed as follows:

$$\mathbf{Y} = \begin{bmatrix} 2^{0} & 2^{-1} & 2^{-2} & \dots & 2^{-(Q-1)} \end{bmatrix} \begin{bmatrix} A_{10} & A_{20} & \dots & A_{10} \\ A_{11} & A_{21} & \dots & A_{11} \\ \vdots & \vdots & \ddots & \vdots \\ A_{1(Q-1)} & A_{2(Q-1)} & \dots & A_{L(Q-1)} \end{bmatrix} \begin{bmatrix} \mathbf{X}^{1} \\ \mathbf{X}^{2} \\ \vdots \\ \vdots \\ \mathbf{X}^{2} \\ \vdots \\ \mathbf{X}^{2} \end{bmatrix}$$
$$\mathbf{Y} = \begin{bmatrix} 2^{0} & 2^{-1} & 2^{-2} & \dots & 2^{-(Q-1)} \end{bmatrix} \begin{bmatrix} \mathbf{y}^{0} \\ \mathbf{y}^{1} \\ \vdots \\ \vdots \\ \mathbf{y}_{(Q-1)} \end{bmatrix}$$

 $A_{i,j}$  stay between [1, 0] Note that  $y_0$  may be 0 or a negative number due to two's complement representation. In (2),  $y_0$  can be calculated by adding all X*i* values when  $A_{i,j}=1$  and then the transform output Y can be obtained by shifting and adding all nonzero  $y_i$  values. Thus the inner product computation in (1) can be implemented by using shifting and adders instead of multipliers. Therefore, low hardware cost can be achieved by using DA-based architecture.

#### **III. OPTIMIZED ADDER TREE ARCHITECTURE**

In general, the shifting and addition computation uses a shift-and-add operator in VLSI implementation in order to reduce hardware cost. However, when the number of the shifting and addition words increases, the computation time will also increase. Therefore, the shift-adder-tree (SAT) presented operates shifting and addition in parallel by unrolling all the words needed to be computed for high-speed applications. However, a large truncation error occurs in SAT, and optimized adder tree architecture is proposed in this brief to compensate for the truncation error in high-speed applications



Fig:1: Q,P bit words shifting and addition operations in parallel.

In Fig. 1, the Q P-bit words operate the shifting and addition in parallel by unrolling all computations. Furthermore, the operation in Fig. 1 can be divided into two parts: the main part (MP) that includes \_ most significant bits (MSBs) and the truncation part (TP) that has least significant bits (LSBs). a large truncation error occurs due to the neglecting of carry propagation from the TP to MP.

The proposed optimized adder tree architecture is illustrated in Fig. 2 for (P,Q)=(12,6), where block FA indicates a full-adder cell with three inputs (a, b, and c) and two outputs, a sum (s) and a carry-out (co). Also, block HA indicates half-adder cell with two inputs (a and b) and two outputs, a sum (s) and a carry-out (co).



### IV. PROPOSED 8X8 2-D DCT DESIGN

The 1-D DCT employs the DA-based architecture and the proposed Optimized adder tree to achieve a high-speed, small area, and low-error design. The 1-D 8-point DCT can be expressed as follows:

$$Z_n = \frac{1}{2} \kappa_n \sum_{m=0}^{N-1} x_m \times \cos\left(\frac{(2m+1)n\Pi}{16}\right)$$

Where  $X_m$  denotes the input data;  $Z_n$  denotes the transform output.

By neglecting the scaling factor 1/2, the 1-D 8-point DCT in above equation can be divided into even and odd parts: Ze and Zo as listed in below equations, respectively

$$Z_{e} = \begin{bmatrix} Z_{0} \\ Z_{2} \\ Z_{4} \\ Z_{6} \end{bmatrix} \begin{bmatrix} C_{4} & C_{4} & C_{4} & C_{4} \\ C_{2} & C_{6} & -C_{6} & -C_{2} \\ C_{4} & -C_{4} & -C_{4} & C_{4} \\ C_{6} & -C_{2} & C_{2} & -C_{6} \end{bmatrix} = \begin{bmatrix} a_{0} \\ a_{1} \\ a_{2} \\ a_{2} \end{bmatrix}$$
$$Z_{0} = \begin{bmatrix} Z_{1} \\ Z_{3} \\ Z_{5} \\ Z_{7} \end{bmatrix} \begin{bmatrix} C_{1} & C_{3} & C_{5} & C_{7} \\ C_{3} & -C_{7} & -C_{1} & -C_{5} \\ C_{5} & -C_{1} & C_{7} & C_{3} \\ C_{7} & -C_{5} & C_{3} & -C_{1} \end{bmatrix} = \begin{bmatrix} b_{0} \\ b_{1} \\ b_{2} \\ b_{2} \end{bmatrix}$$

Where  $C_i = \cos(i\pi/16)$  below shows bit level formulation for  $Z_0$  and  $Z_4$ . Let see  $Z_4$  evaluation

|          | $Z_1$                          | Z4       |                |
|----------|--------------------------------|----------|----------------|
| weight   | value                          | weight   | value          |
| $-2^{0}$ | 0                              | $-2^{0}$ | A <sub>1</sub> |
| 2-1      | $B_0 + B_1 + B_2$              | 2-1      | A <sub>0</sub> |
| 2-2      | $B_0+B_1$                      | 2-2      | A <sub>1</sub> |
| 2-3      | B <sub>0</sub> +B <sub>3</sub> | 2-3      | A <sub>0</sub> |
| 2-4      | $B_0 + B_1 + B_3$              | 2-4      | A <sub>0</sub> |
| 2-5      | $B_0 + B_2$                    | 2-5      | A <sub>1</sub> |

#### TABLE: 1: bit level formulation

Where  $A_0 = (X_0 + X_7) + (X_4 + X_3) = a_{0+}a_3$  $A_1 = (X_1 + X_6) + (X_2 + X_5) = a_{1+}a_2$ 

$$A_1 = (X_1 + X_6) + (X_2 + X_5) = a_1 + a_2$$
  
 $B_0 = (X_0 + X_2) - (X_4 + X_2) = a_0 a_2$ 

$$B_1 = (X_1 + X_6) - (X_2 + X_5) = a_1 a_2$$

Input data  $A_0$  and  $A_1$ , the transform output  $Z_0$  needs only one adder to compute  $(A_0 + A_1)$  and two separated optimized adder trees to obtain the results of  $Z_0$  and  $Z_4$ . Similarly, the other transform outputs  $Z_0$  and  $Z_4$  can be implemented in DA-based forms using 10(=1 + 9) adders and corresponding optimized adder trees. Consequently, the proposed 1-D 8-point DCT architecture can be constructed as illustrated in Fig. 3 using a DA-Butterfly-Matrix, that includes two DA even processing elements (DAEs), a DA odd processing element (**DAO**) and 12 adders/sub tractors, and 8 optimized adder trees (one optimized adder tree for each transform output  $Z_n$ ). The eight separated optimized adder trees work simultaneously, enabling high-speed applications to be achieved. After the data output from the DA-Butterfly-Matrix is completed, the transform output Z will be completed during one clock cycle by the proposed OPTIMIZED ADDER TREEs. In contrast, the traditional shift-and-add architecture requires Q clock cycles to complete the transform output Z if the DA-precision is Q-bits here is 9 bits

#### V. Dwt Using Da

The Haar wavelet is also the simplest possible wavelet. The technical disadvantage of the Haar wavelet is that it is not continuous, and therefore not differentiable. The Haar wavelet transformation is composed of a sequence of low-pass and high-pass filters, known as a filter bank. In mathematics, the Haar wavelet is a certain sequence of functions.

The haar kernel matrix is given below

| $\mathbf{H}_8 = \frac{1}{\sqrt{8}}$ | $\sqrt{2}$<br>0<br>2<br>0 | $\sqrt{2}$<br>0<br>-2<br>0 | $-\sqrt{2}$<br>0<br>0<br>2 | $-\sqrt{2}$<br>0<br>0<br>-2 |        |             |                                       | $     \begin{array}{c}       0 \\       -\sqrt{2} \\       0 \\       0     \end{array}   $ | $\begin{array}{c} \varphi_{0}(t) \\ \psi_{0}(t) \\ \psi_{1,0}(t) \\ \psi_{1,1}(t) \\ \psi_{2,0}(t) \\ \psi_{2,1}(t) \\ \psi_{2,2}(t) \end{array}$ |
|-------------------------------------|---------------------------|----------------------------|----------------------------|-----------------------------|--------|-------------|---------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
|                                     | 0<br>0                    | 0<br>0                     | 0<br>0                     | 0<br>0                      | 2<br>0 | $^{-2}_{0}$ | $\begin{array}{c} 0 \\ 2 \end{array}$ | $^{0}_{-2}$                                                                                 | $\psi_{2,2}(t) \\ \psi_{2,3}(t)$                                                                                                                  |

The DA is applied to above kernel matrix the coefficient term are evaluated those are applied to OAT to get transformed value.

| WEIGHT | Z1    | 72          | Z3    | Z4    |
|--------|-------|-------------|-------|-------|
| -2^0   | 0     | 0           | 0     | 0     |
| 2^-1   | 0     | 0           | a1-a2 | a3-a4 |
| 2^-2   | A0+A1 | a1+a2-a3-a4 | 0     | 0     |
| 2^-3   | 0     | 0           | 0     | b0    |
| 2^-4   | A0+A1 | a1+a2-a3-a4 | 0     | 0     |
| 2^-5   | A0+A1 | a1+a2-a3-a4 | 0     | 0     |
| 2^-6   | 0     | 0           | 0     | 0     |
| 2^-7   | A0+A1 | a1+a2-a3-a4 | 0     | 0     |
| 2^-8   | 0     | 0           | 0     | 0     |

Table2: DA based coefficients

#### VI. **S-DHT BASED ON DA**

DHT belongs to the family of frequency transforms that map temporal or spatial functions into frequency functions. The DHT accomplishes this in a manner similar to the better-known Fourier Transform. The significant difference between the Discrete Fourier Transform (DFT) and DHT 's alternative is that the DHT uses only real values, i.e., no complex numbers.

The kernel matrix of DHT I given by

| 1.0000 | 1.0000  | 1.0000  | 1.0000  | 1.0000   | 1.0000  | 1.0000    | 1.0000  |
|--------|---------|---------|---------|----------|---------|-----------|---------|
| 1.0000 | 1.4140  | 1.0000  | 0       | -1.0000  | -1.414  | 0 -1.0000 | 0 (     |
| 1.0000 | 1.0000  | -1.0000 | -1.0000 | 1.0000   | 1.0000  | -1.0000   | -1.0000 |
| 1.0000 | 0       | -1.0000 | 1.4140  | -1.0000  | 0       | 1.0000    | -1.4140 |
| 1.0000 | -1.0000 | 1.0000  | -1.0000 | 1.0000 · | -1.0000 | 1.0000    | -1.0000 |
| 1.0000 | -1.4140 | 1.0000  | 0       | -1.0000  | 1.4140  | -1.0000   | 0       |
| 1.0000 | -1.0000 | -1.0000 | 1.0000  | 1.0000 · | -1.0000 | -1.0000   | 1.0000  |
| .0000  | 0       | -1.0000 | -1.4140 | -1.0000  | 0       | 1.0000    | 1.4140  |

Apply DA on above kernel then we get coefficients as,

|                |      |      |      |      |      |      | <i></i> |      |
|----------------|------|------|------|------|------|------|---------|------|
|                | Y(0) | Y(1) | Y(2) | Y(3) | Y(4) | Y(5) | Y(6)    | Y(7) |
| ALUI           | +    | -    | +    | -    | +    | -    | +       | -    |
| ALU2           | +    | -    | +    | NO   | +    | -    | +       | NO   |
| ALU3           | +    | -    | +    | -    | +    | -    | +       | -    |
| ALU4           | +    | NO   | +    | -    | +    | NO   | +       | -    |
| Y-1            | 0    | 0    | R3   | R10  | R7   | R2   | R8      | R3   |
| $Y^0$          | R5   | R1   | R5   | R4   | R5   | R9   | R5      | R9   |
| $Y^1$          | 0    | 0    | 0    | 0    | 0    | R2   | 0       | R6   |
| $Y^2$          | 0    | R2   | 0    | R6   | 0    | 0    | 0       | 0    |
| Y <sup>3</sup> | 0    | R2   | 0    | R6   | 0    | 0    | 0       | 0    |
| $Y^4$          | 0    | 0    | 0    | 0    | 0    | R2   | 0       | R6   |
| Y              | 0    | R2   | 0    | R6   | 0    | 0    | 0       | 0    |
| $Y^6$          | 0    | 0    | 0    | 0    | 0    | R2   | 0       | R6   |
| Y?             | 0    | R2   | 0    | R6   | 0    | R2   | 0       | R6   |

"+" means addition and "-" means subtraction and "NO" means no operation.

Here R1,R2...R10 are given by



#### VII. Result And Discussions

The proposed DCT, DHT and DWT core synthesized by using Xilinx ISE 13.4, simulated by using modelsim6.4d and the Xilinx XC2VP30 FPGA. The corresponding results are shown below

| Table 5. error          | Table 5: error due to precision taken |  |  |  |  |  |
|-------------------------|---------------------------------------|--|--|--|--|--|
| Transform coding scheme | Error after reconstruction(for        |  |  |  |  |  |
|                         | 256x256 image)                        |  |  |  |  |  |
| DCT                     | 240                                   |  |  |  |  |  |
| DWT                     | 82.92                                 |  |  |  |  |  |
| DHT                     | 25.68                                 |  |  |  |  |  |

## Table 3: error due to precision taken

#### Table 4: logic utilization

| 1                  | able 11 logie u | tinzation |     |
|--------------------|-----------------|-----------|-----|
| Logic utilization  | DCT             | DWT       | DHT |
| No. of slices      | 11%             | 3%        | 1%  |
| No. of 4 input     | 10%             | 2%        | 1%  |
| LUT's              |                 |           |     |
| No. of bounded I/O | 72%             | 75%       | 34% |

Here the differentiating parameters are hardware efficiency and energy compaction as well as computation complexity. The computation complexity is less in S-DHT but provides less energy compaction compared to remaining. Throughput achieved is

#### DCT-380MP/S

Haar wavelet transform-389MP/S

DHT-571MP/S DHT modelsim result:

|                   |                | DHT modelsim re        |                     |
|-------------------|----------------|------------------------|---------------------|
| Objects           | ``````         |                        | 📰 wave - default    |
| Name              | Value          | Kind                   | Messages            |
| <b>⊞-</b> ∲ X0    | 7              | Net                    | ■                   |
| <b>⊞-</b> ∲ X1    | 10             | Net                    | P-4 /BMATRIX/X1 10  |
| <b>⊞-</b> ∲ X2    | 7              | Net                    | P-4 /BMATRIX/X2 7   |
| <b>⊞-</b> ∲ X3    | 26             | Net                    | → /BMATRIX/X3 26    |
| <b>⊞-</b> ∲ Z0    | 50             | Net                    |                     |
| <b>⊡-</b>         | -16            | Net                    |                     |
| <b>⊡-</b> ∲ Z2    | -22            | Net                    | • /BMATRIX/Z2 -22   |
| <b>⊡-</b>         | 16             | Net                    | • /BMATRIX/Z3 16    |
| <b>⊞-</b> ∲ a0    | 50             | Net                    | • /BMATRIX/a0 50    |
| <b>⊞-</b> ∲ a1    | -16            | Net                    | • /BMATRIX/a1 -16   |
| <b>⊞-</b> ∲ a2    | -22            | Net                    | P-√ /BMATRIX/a2 -22 |
| <b>⊞-</b> ∲ a3    | 16             | Net                    |                     |
| . <b>⊡-</b> → mem | x000000000 x00 | xxxxx Fixed-size Array | √glbl/GSR We0       |
|                   |                |                        | ygibi/GSK Web       |
|                   |                |                        |                     |
|                   |                |                        |                     |
|                   |                |                        |                     |

DHT matlab result:



|                  | Haar wa       | velet me | delsim result:         |                |
|------------------|---------------|----------|------------------------|----------------|
| Objects          |               |          | 🛨 🖻 🗙 📭 wave - default |                |
| ▼ Name           | Value         | Kind     |                        |                |
| <b>≖-</b> ↔ X0   | 010101010     | Net      | HAAR/X0 0101           | 01010          |
| <b>⊞-</b>        | 011110101     | Net      |                        | 10101          |
| <b>⊞-</b> ∲ X2   | 000011111     | Net      |                        | 11111          |
| <b>≖⊢</b> ↔ X3   | 000000111     | Net      |                        | 00111          |
| <b></b> ↓ X4     | 010110101     | Net      |                        | 10101          |
| <b></b> ≁ x5     | 000111000     | Net      |                        | 10101<br>11000 |
| <b>≖⊢</b> (>     | 00000001      | Net      |                        |                |
| . <b>≖-</b> ◆ X7 | 000011111     | Net      |                        | 00001          |
| <b>⊥</b> 20      | 0000011110111 | Net      |                        | 11111          |
| . <b></b> ≁ Z1   | 0000000111110 | Net      |                        | 011110111      |
| <b>≖-</b> ₹2     | 0000010111100 | Net      |                        | 000111110      |
| <b>≖-</b> ₹3     | 0000001100110 | Net      |                        | 010111100      |
| <b>≖-</b> ₹      | 1111111001011 | Net      |                        | 001100110      |
| <b>≖</b> 25      | 000000010000  | Net      |                        | 111001011      |
| 🛨 🔷 Z6           | 0000001010100 | Net      |                        | 000010000      |
|                  | 1111111101011 | Net      |                        | 001010100      |
|                  | 415           | Net      |                        | 111101011      |
|                  | 38            | Net      |                        | 000000 ps      |
|                  | 237           | Net      | Cursor 1               | 0 ps           |
| ■                | 32            | Net      |                        | •              |
|                  | 700           | Not      |                        |                |
| •                |               |          | wave                   |                |

#### тт - . . . .

Haar wavelet matlab result:



#### VI. **CONCLUSION**

The paper contributed with specific simplifications in the multiplier stage, by using shift and adds method, which lead to hardware simplification and speed up over architecture.

#### **References:**

- [1] S. Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y. Yamashita, H. Yerane, and M. Yoshimoto, "A 100-MHz 2-D discrete cosine transforms core processor," *IEEE J. Solid-State Circuits*, vol. 27, no. 4, pp. 492–499, Apr. 1992. S. Yu and E. E. S. , Jr., "DCT implementation with distributed arithmetic," *IEEE Trans. Comput.*, vol. 50, no. 9, pp. 985–991, Sep. 2001.
- [2] 714 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011
- A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, "NEDA: A low-power high-performance DCT architecture," IEEE Trans. [3] Signal Process., vol. 54, no. 3, pp. 955-964, Mar. 2006.
- C. Peng, X. Cao, D. Yu, and X. Zhang, "A 250 MHz optimized distributed architecture of 2D 8 B DCT," in Proc. Int. Conf. ASIC, 2007, [4] pp. 189-192.
- P. K. Meher, T. Srikanthan, J. C. Patra, "Scalable and Modular Memory-Based Systolic Architectures for Discrete Hartley Transform," [5] IEEE Transactions on Circuits and Systems, vol.53, no.5, pp. 1065 – 1077, May 2006.
- P. Longa, A. Miri, and M. Bolic, "Modified distributed arithmeticased architecture for discrete wavelet transforms," Electronics Letters [6] vol. 44, no. 4, Feb. 2008.
- Y.Wang, J. Ostermann, and Y. Zhang, Video Processing and Communications, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 2002. [7]