ultrafast arithmetic

Addition with fast lookahead-carry,
Fast multi-way addition,
Multiplication by selective-multi-way-addition,
Low-complexity squaring in VHDL,
Multiplication by squaring,
Divison by binary-scaled super-radix,
Complex-exponentiation like old-multiplication,
Trigonometrics by successive approximation ...

BACKGROUND:

We can estimate the metric efficiency of complex hardware circuitry, as gatestime: cf on the large-scale, doubling-the-hardware accomplishes twice-as-much in the-same-time or equally-as-much in half-the-time on-the-average; (A better estimate might add-up gatestime 'entropies')....
Blended-cyclic-and-complex processes may oft be simplified by pipelining --allowing new data to begin a cycle while the current is advancing,-- but which adds a gate-delay for each single-level fast-catch-latch-register and which must be either cleared before switching evironments or tagged, stages self-directing, to deposit its results across-environments....

ADDITION with FAST LOOKAHEAD CARRY:

An already well-known result--

The basic 3-bit-'full'-adder compacts 2-bits of a 2-way-sum-input with a carry-propagate-in into a 1-bit-sum-output and a carry-propagate-out, s_i = a_ib_ic_i + a_ib_ic_i + a_ib_ic_i + a_ib_ic_i, and c_i+1 = a_ib_i + a_ic_i + b_ic_i , (averaging 2.3-AND-terms-in plus 1.0-OR-terms-out per-bit, implemented typically in 1-gate-delay per-AND-OR), But such propagation is slow, so--

The remedy implements a fast Lookahead-Carry, but which, to keep the complexity low, requires a second gate-level: of AND-OR chains of lower-order pre-OR's of input-bits and initial carry-in's, c_i = Σ_k (a_i-1+b_i-1)···(a_i-k+b_i-k)(a_i-k-1b_i-k-1 = c_i-k else c₀), (whereas one-level-without-the-AND-OR-of-pre-OR's explodes AND-terms by the distribution of the same pre-OR AND-products needed)....

The result is, that, while the 3-bit-adder does-compact 2:1, 2-bits-in-to-1-bit-out, that second level Lookahead costs a gate-delay and so on-the-average compacts only 1.41-bits-average-in per gate-delay, And, for long words e.g. 63-bit, the Lookahead builds input-terms to ~63³-huge: The remedy in most implementations combines partials, local-carry-out and local-propagate, and their complements, from groups-of-groups, which is single-level possible because OR-AND is equivalent to INVERT-AND-OR-INVERT, (Note that CARRY-INVERT is equivalent to INVERT-CARRY which chains INVERT-PROPAGATE though itself not equivalent to PROPAGATE-INVERT), to simplify the higher-bit AND-OR terms to ~63² more-manageably ... Some implementations deliver co-complement outputs, while others use reference-differential-inputs... (Some implementations can also use multiple-isolated-outputs to be wire-AND'ed under a constant-current pullup, and thence each pre-OR may put out upto-6-lookahead-lines, one for itself, one a pairing, plus 4-, 8-, 16-, and 32-blocks.)

FAST(ER) MULTI-WAY ADDITION:

In comparison to the 2-way-add, the 3-bit-adder itself, compacts 3:2, 1.5-bits-average-in-to-1-bit-out, and obviates all-but-one-many Lookahead-Carries until the last, form-normalizing sum-with-Lookahead-Carry, (Note 1.5⁷ > 1.41⁸) ... so-- we may find applications that are more-efficient by retaining the spray of carry-bits vertically-locally until the final add-with-Lookahead-Carry-to-normal-form... Such is multi-way addition--

Using 3:2's only, optimal compactions are: (by 1:2-with-Lookahead-Carry) 2-bits (2:3) 3 bits (2:3 +1) 4 (2× 2:3) 6 (3× 2:3) 9 (4× 2:3 +1) 13 (6× 2:3 +1) 19 (9× 2:3 +1) 28 (14× 2:3) 42 (21× 2:3) 63-bits ... (1-bit-shy of 64-bit but fully usable as-is in signed-multiplication), and slightly faster at 11-gate-delays instead-of-12, and much-simpler with 1-instead-of-62 Lookahead-Carry's....

FAST(ER)(ER) MULTI-WAY ADDITION:

Noting first the near-exponentially increasing circuitry-complexity at greater compaction, the next, fast-compactors are 5:3, 6:3, and fastest is 7:3. The immediate next-two, 8:4, 9:4, are less than 7:3 yet cost 2× and 4× more circuitry (And 8:4 compacts like 6:3, at 2:1).

1. The last-compaction is necessarily the 2-way-add-with-Lookahead-Carry 2:1-compaction in 2-gate-delays,
2. Its preceding is necessarily (at-best) the 3:2-compaction in 1-gate-delay,
3. Its prior-preceding is necessarily (at-best) the 7:3-compaction in 1-gate-delay,
4. And thereafter multiple 7:3-compactions as needed for the original multi-way-add....

Building-up from a final normal-form-output, shows this is optimal at--
2-gate-delays: 2-bits (the basic 1:2-with-Lookahead-Carry),
3-gate-delays: 3-bits (1:2-with-Lookahead-Carry, 2:3),
4-gate-delays: 7-bits (1:2-with-Lookahead-Carry, 2:3, 3:7),
5-gate-delays: 15-bits (1:2-with-Lookahead-Carry, 2:3, 3:7, 2× 3:7 + 1),
6-gate-delays: 35-bits (1:2-with-Lookahead-Carry, 2:3, 3:7, 2× 3:7 + 1, 5× 3:7),
7-gate-delays: 80-bits (1:2-with-Lookahead-Carry, 2:3, 3:7, 2× 3:7 + 1, 5× 3:7, 11× 3:7 + 2:3),
... etc., or, 1-fewer-bit-per for matrix-vector-products-(sums), multiply-and-cumulate, etc....

MULTIPLICATION BY SELECTIVE-MULTI-WAY-ADDITION:

Since the multiplication-parallelogram (rhombus if both words are the same bit-width) is computationally thickest around the vertical 'diagonal' column, we can use 7:3's there, but less-massive 6:3's around that etc. down to minimally-massive 3:2's at the tail-ends....

So, for example, 63-bit-unsigned-multiplication the old-way takes 2×6 = 12 level-gate-delays, but we can do it in 7-levels: to wit: 63 bits = 9× 7:3 = 27 = 4× 7:3 = 12 = 2× 6:3 = 6 = 1× 3:2 = 2 = 1× 2:1-with-Lookahead-Carry.

[UNDER CONSTRUCTION]

MULTIPLICATION BY HALFWORDS:

An already well-known result--

Using 3-instead-of-4 halfword-multiplies... Or, we can do it in 2-instead-of-3-or,-4 halfword-multiplies....

....

SQUARING:

In the same way, 80-bit-squaring and 70-bit-multiplication take 6-gate delays....

Squaring is low-complexity because it takes one, parameter, the difference between successive squares is increasing, it has some supplementary 'symmetry' (1-n)²....

Compared to full multiplication, Squaring is half-or-less complex, [A1.A0]² = [A1]² + 2 [A1*A0] + [A0]² where its complexity-pyramid around the middle term peaks at half-or-less for half-as-many bits and so is quarter-or-less complex over all bits....

In fact for example, a 4-bit squaring logic is single-level:

00000	00000 00000
00001	00000 00001
00010	00000 00100
00011	00000 01001
00100	00000 10000
00101	00000 11001
00110	00001 00100
00111	00001 10001
01000	00010 00000
01001	00010 10001
01010	00011 00100
01011	00011 11001
01100	00100 00000
01101	00101 01001
01110	00110 00100
01111	00111 00001
10000	01000 00000
10001	01001 00001
10010	01010 00100
10011	01011 01001
10100	01100 10000
10101	01101 11001
10110	01111 00100
10111	10000 10001
11000	10010 00000
11001	10011 10001
11010	10101 00100
11011	10110 11001
11100	11000 10000
11101	11010 01001
11110	11100 00100
11111	11110 00001

Q0 = ?D0

Q1 = 0

Q2 = ?D1!D0

Q3 = ?D2!D1?D0 + !D2?D1?D0

Q4 = ?D2!D1!D0 + ?D3!D2?D0 + !D3?D2?D0

Q5 = ?D4?D3!D2?D1!D0 + ?D4!D2!D1?D0 + ?D4!D3?D2?D1!D0 + ?D4!D3!D1?D0 + ?D4!D3!D2?D0 + !D4?D3?D2?D0 + !D4?D3!D2?D1 + !D4!D3?D2?D1

Q6 = ?D4?D3?D0 + ?D3!D2!D1 + ?D4!D3?D1!D0 + ?D4!D3!D2?D1 + !D4?D3?D1 + !D4?D3!D2

Q7 = ?D4?D3?D1 + ?D4!D3?D2!D1 + ?D4!D3?D2!D0 + !D4?D3?D2
or = ?D4?D3?D1 + ?D4!D3?D2*![?D1?D0] + !D4?D3?D2

Q8 = ?D4?D3?D2 + ?D4!D3!D2 + ?D4!D3!D1 + ?D4!D3!D0
or = ?D4?D3?D2 + ?D4!D3*![?D2?D1?D0]

Q9 = ?D4?D3 + ?D4?D2?D1?D0

ULTRAFAST MULTIPLY: [*]

A*B = (A+B)² - (A-B)² / 4

...

FAST LONG-DIVISION:

By coarse-estimating the inverse of the divisor, in an 8-11-bit-index lookup table, with the high-bit always '1', and fast multiplying-back-through-with-subtraction, by a partial-word-multiplier (selective-multi-way-adder), 'long' division can be reduced to about-6-8 cycles, each about-7-levels, for about-35 gate-delays total....

[UNDER RECONSTRUCTION]

Multiplication-like CORDIC * (chordic)

multiplicative-like, progressive cumulation

Multiplication on early digital computers was typically by sequential approximation and repetition of addition, progressive cumulation like manual long-multiplication and long-division implemented in hardware (equipment) for integers, and in software emulation aided by floating-point addition, -or fully hardware depending on cost, space, availability, factors-... In the present era it has become replaced by parallel micro-circuitry.

But by freeing this procedure from fixed multiplier and multiplicand, and speed-ups with simple fast bit-scaler ("barrel-shifter") and sequential-access memory ("SAM") or other, more functions can be efficiently calculated:-- Exponentiation; circular-function sinusoid trigonometric; hypertrigonometric; arithmetic-like data-compression, ... by algorithm. (Even square-root is such an algorithm.)

Exponentiation could be accomplished by cumulative successive partial multiplications, Π(1 + d x_i) where x_i are chosen to simplify the multiplication to just addition of d=2^-i, a bit-scaling (shift) and add or subtract ... A table of x_i is typically n-bits long ... It might be even practical to build the calculation directly in hardware, even as multiplication was so built-- and useful, as sinusoids are useful in video construction.

The fastest bit-scaler is built on base-8 (near base e² ~7.39), -requiring the least hardware reduplication and substrate 'realestate'.

* [Author's note: My original article was more inclusive than the ca-1950's utility, CORDIC CoOrdinate Rotation Digital Computer [algorithm] used for circular and hyperbolic x-y pair rotations,- but as it was conceptually related as progressive cumulation of additions, I kept the other's -title- as a subject-reference: believing it had meant the -subject- of successive narrowing of measure along a cord aka chord, a progressive binary-search-like calculus-analytic method for measuring nonlinear arcs by computation, and especially for reducing its error coefficient. (cf Then-new college subject, my UCSD Department of Applied Physics and Information Sciences APIS, also-noted Egyption bull, only years after I graduated, became EECS Electronic Engineering and Computer Sciences; but we'd built computers and interfaces from small-scale-integration in that era; and SAR Successive Approximation Register was ordinary technology, decades before calling that, pyramidal.) In my algorithm discovered-on-demand for myself in college I'd used the process for exponentials, and later heard and read of the CORDIC describing the same method with paired complex rotation angle increments chosen similarly to simplify the whole to multiplication by successively selected single-additions]

* [When an engineer at Linkabit asked whether there were a faster way to implement a 4-bit multiplication for our microprocessor, I replied, (A+B)² - A² - B² / 2, because our microprocessor had a 32-place lookup table for fast squaring]