# Square Root. Part 2 – pipelined version.

In a previous post I implemented an iterative version of a square root algorithm described in the article: “A New Non-Restoring Square Root Algorithm and Its VLSI Implementations” written by Yamin Li and Wanming Chu. Here I present the pipelined version from the same article.

### Pipelined version

Pipelined version, compared to iterative, uses a separated add/sub block in every step. This obviously has a negative impact on used resources, but on the other hand, it allows to calculate square root value for a new data in every cycle. Algorithm does not have to wait for finish previous calculation to start the new one.

Algorithm of pipelined square root is depicted in Fig. 1. Fig. 1. Pipelined version of square root algorithm.src.: ”A New Non-Restoring Square Root Algorithm and Its VLSI Implementations”

#### Implementation

Below is shown implementation of the pipelined version. I placed here only the most interesting parts. To investigate all codes, please refer to the sources on gitlab.com.

Block uses two variables (va_AluInR, va_AluInQ) for preparing input values for ALU (add/sub operation), one variable to keep temporary result of ALU (va_AluOut) and a few registers (a_data, a_R, a_Q, nextOp) to store results used in the next stage. All variables and registers are declared as arrays.

Following code calculates values for the first stage. Input variable va_AluInR is filled with zeros and two MSB of input data. Variable va_AluInQ is filled with default value. Then in the line 77 is done arithmetic operation. For the first stage it is always subtracting. The last step saves result to all registers:

• a_R – result of the arithmetic operation
• a_Q – final result – updated with one bit in every cycle
• nextOp – it defines operation (add/sub) for the next step
• a_data – stores input data, shifts 2 bits left in every cycle
``` -- stage 0 start conditions, ALU inputs
va_AluInR(0)             := (others => '0');
va_AluInR(0)(1 downto 0) := unsigned(iv_data(G_DATA_W-1 downto G_DATA_W-1-1));
va_AluInQ(0)             := (others => '0');
va_AluInQ(0)(0)          := '1';

-- stage 0 calculations
va_AluOut(0) := va_AluInR(0) - va_AluInQ(0);

-- stage 0 result registers, ALU output
a_data(0)                              <= shift_left(unsigned(iv_data), 2);
a_R(0)                                 <= (others => '0');
a_R(0)(G_DATA_W-1 downto G_DATA_W-1-1) <= va_AluOut(0)(1 downto 0);
a_Q(0)(0)                              <= not va_AluOut(0)(2);
nextOp(0)                              <= not va_AluOut(0)(2);```

Below is presented main loop which calculates in parallel data for all next stages. First are updated inputs to ALU. Variable va_AluInR is a concatenation of two MSB bits of data from register a_data and partial result of ALU from previous stage (a_R). Variable va_AluInQ mainly keeps value from the final result register (a_Q). Then based on nextOp is executed arithemtics operation. The last step writes results to registers described before.

``` -- next stages
for i in 1 to C_PIPE_L-1 loop
-- prepare inputs for next stage
va_AluInR(i)                            := (others => '0');
va_AluInR(i)(C_OFFSET+i-1 downto 2)     := a_R(i-1)(G_DATA_W-(i-1)-1 downto G_DATA_W-(2*i));
va_AluInR(i)(2-1 downto 0)              := a_data(i-1)(G_DATA_W-1 downto G_DATA_W-1-1);
va_AluInQ(i)                            := (others => '0');
va_AluInQ(i)(C_OFFSET+(i-1)-1 downto 2) := a_Q(i-1)(i-1 downto 0);
va_AluInQ(i)(1)                         := not a_Q(i-1)(0);
va_AluInQ(i)(0)                         := '1';

if (nextOp(i-1) = '1') then
va_AluOut(i) := va_AluInR(i) - va_AluInQ(i);
else
va_AluOut(i) := va_AluInR(i) + va_AluInQ(i);
end if;

-- resultr registers
a_data(i)                                    <= shift_left(unsigned(a_data(i-1)), 2);
a_R(i)                                       <= (others => '0');
a_R(i)(G_DATA_W-i-1 downto G_DATA_W-2*(i+1)) <= va_AluOut(i)(i+1 downto 0);
a_Q(i)                                       <= shift_left(unsigned(a_Q(i-1)), 1);
a_Q(i)(0)                                    <= not va_AluOut(i)(i+2);
nextOp(i)                                    <= not va_AluOut(i)(i+2);

end loop;```