Recently I had to implement square root algorithm in an FPGA. During research I found quite old, but interesting article: “A New Non-Restoring Square Root Algorithm and Its VLSI Implementations” written by Yamin Li and Wanming Chu. For educational purposes I decided to implement presented algoritm in VHDL.
Authors propose various versions of the algorithm – pipelined and iterative. Both versions calculate one bit per cycle and use only ADD/SUB operation. The difference is in resources and performance. Pipelined version can produce the new result in every cycle, but it uses much resources. For every stage are needed separated registers and ALU units. Iterative version is very lightweight, because in every stage are used the same registers and only one single ALU unit. However it can initiate a new square root instruction only when the old one is finished.
How the algorithm works in iterative version can be seen in Fig. 1. For details please refer directly to the article.
Below you can find implementation in VHDL. I placed here only the most interesting parts. To investigate all codes, please refer to the sources on gitlab.com.
Start conditions are displayed below. Register aluInLoopR keeps values which go out from ALU unit. At beginning it is filled with ‘0’. Only two LSB bits are driven by two MSB bits from data input. Accordingly to specification, register aluInQ is also filled with ‘0’, except bit . Register cyc_cnt counts the cycles to give output about finished calculation.
if (i_startCal = '1') then data <= shift_left(unsigned(iv_data), 2); aluInLoopR <= (others => '0'); aluInLoopR(1 downto 0) <= unsigned(iv_data(G_DATA_W-1 downto G_DATA_W-1-1)); aluInQ <= (others => '0'); aluInQ(0) <= '1'; v_resAlu_tmp := (others => '0'); nextOp <= '0'; endCal <= '0'; cyc_cnt <= to_unsigned(C_CLKCYC_NR, cyc_cnt'length); end if;
This is the heart of the block. ADD/SUB operation depends on the last result from ALU and is driven by nextOp register. Output from ALU is kept in variable and based on it, all registers are updated in every cycle:
- data register is shifted by 2 bits
- aluInLoopR register is updated with last ALU result and 2 bits from data register on LSB positions
- aluInQ is shifted and updated with MSB from ALU
- nextOp is updated with MSB from ALU
if cyc_cnt > 0 then if (nextOp = '1') then v_resAlu_tmp := aluInLoopR + aluInQ; else v_resAlu_tmp := aluInLoopR - aluInQ; end if; cyc_cnt <= cyc_cnt - 1; -- update registers data <= shift_left(data, 2); aluInLoopR(C_ALU_W-1 downto 2) <= v_resAlu_tmp(C_ALU_W-2-1 downto 0); aluInLoopR(1 downto 0) <= data(G_DATA_W-1 downto G_DATA_W-1-1); aluInQ(C_ALU_W-1 downto 2) <= shift_left(aluInQ(C_ALU_W-1 downto 2), 1); aluInQ(2) <= not v_resAlu_tmp(v_resAlu_tmp'high); aluInQ(1) <= v_resAlu_tmp(v_resAlu_tmp'high); nextOp <= v_resAlu_tmp(v_resAlu_tmp'high); end if;
At last I have to point out two details about sources:
- signals startCal and endCal inform about start and end of calculation.
- the partial remainder is not calculated and connected to the output port, but it can be done quite easily. That process is explained in the paper.
Results of simulation for one calculation is depicted in Fig. 2.
*** *** ***
All source codes used in that post you can find on gitlab.com .
*** *** ***