In a previous post I implemented an ** iterative **version of a square root algorithm described in the article: “A New Non-Restoring Square Root Algorithm and Its VLSI Implementations” written by Yamin Li and Wanming Chu. Here I present the

**version from the same article.**

*pipelined*### Pipelined version

*Pipelined *version, compared to *iterative*, uses a separated add/sub block in every step. This obviously has a negative impact on used resources, but on the other hand, it allows to calculate square root value for a new data **in every cycle**. Algorithm does not have to wait for finish previous calculation to start the new one.

Algorithm of *pipelined *square root is depicted in Fig. 1.

#### Implementation

Below is shown implementation of the *pipelined *version. I placed here only the most interesting parts. To investigate all codes, please refer to the sources on gitlab.com.

Block uses two variables (*va_AluInR*, *va_AluInQ*) for preparing input values for ALU (add/sub operation), one variable to keep temporary result of ALU (*va_AluOut*) and a few registers (*a_data*, *a_R*, *a_Q*, *nextOp*) to store results used in the next stage. All variables and registers are declared as arrays.

Following code calculates values for the first stage. Input variable *va_AluInR *is filled with zeros and two MSB of input data. Variable *va_AluInQ *is filled with default value. Then in the line 77 is done arithmetic operation. For the first stage it is always subtracting. The last step saves result to all registers:

*a_R*– result of the arithmetic operation*a_Q*– final result – updated with one bit in every cycle*nextOp*– it defines operation (add/sub) for the next step*a_data*– stores input data, shifts 2 bits left in every cycle

-- stage 0 start conditions, ALU inputs va_AluInR(0) := (others => '0'); va_AluInR(0)(1 downto 0) := unsigned(iv_data(G_DATA_W-1 downto G_DATA_W-1-1)); va_AluInQ(0) := (others => '0'); va_AluInQ(0)(0) := '1'; -- stage 0 calculations va_AluOut(0) := va_AluInR(0) - va_AluInQ(0); -- stage 0 result registers, ALU output a_data(0) <= shift_left(unsigned(iv_data), 2); a_R(0) <= (others => '0'); a_R(0)(G_DATA_W-1 downto G_DATA_W-1-1) <= va_AluOut(0)(1 downto 0); a_Q(0)(0) <= not va_AluOut(0)(2); nextOp(0) <= not va_AluOut(0)(2);

Below is presented main loop which calculates in parallel data for all next stages. First are updated inputs to ALU. *Variable *v*a_AluInR* is a concatenation of two MSB bits of data from register *a_data* and partial result of ALU from previous stage (*a_R*). Variable v*a_AluInQ *mainly keeps value from the final result register (*a_Q*). Then based on *nextOp* is executed arithemtics operation. The last step writes results to registers described before.

-- next stages for i in 1 to C_PIPE_L-1 loop -- prepare inputs for next stage va_AluInR(i) := (others => '0'); va_AluInR(i)(C_OFFSET+i-1 downto 2) := a_R(i-1)(G_DATA_W-(i-1)-1 downto G_DATA_W-(2*i)); va_AluInR(i)(2-1 downto 0) := a_data(i-1)(G_DATA_W-1 downto G_DATA_W-1-1); va_AluInQ(i) := (others => '0'); va_AluInQ(i)(C_OFFSET+(i-1)-1 downto 2) := a_Q(i-1)(i-1 downto 0); va_AluInQ(i)(1) := not a_Q(i-1)(0); va_AluInQ(i)(0) := '1'; -- ALU ADD/SUB if (nextOp(i-1) = '1') then va_AluOut(i) := va_AluInR(i) - va_AluInQ(i); else va_AluOut(i) := va_AluInR(i) + va_AluInQ(i); end if; -- resultr registers a_data(i) <= shift_left(unsigned(a_data(i-1)), 2); a_R(i) <= (others => '0'); a_R(i)(G_DATA_W-i-1 downto G_DATA_W-2*(i+1)) <= va_AluOut(i)(i+1 downto 0); a_Q(i) <= shift_left(unsigned(a_Q(i-1)), 1); a_Q(i)(0) <= not va_AluOut(i)(i+2); nextOp(i) <= not va_AluOut(i)(i+2); end loop;

#### Simulation

- It is worth to notice that input data is being changed in every cycle. It was not possible to do it in the
*iterative*version. - The partial remainder is not calculated and connected to the output port, but it can be done quite easily.
- It is not seen on the screens, but at beginning, block produces some not relevant data for initial conditions. This happens because the block works all the time, if it is not in the reset state.
- Because the block works all the time and there is no flag for end of calculations, user has to take care about correct receiving the data.

*** *** ***

All source codes used in that post you can find on gitlab.com.

*** *** ***