ultrafast arithmetic

Addition with fast lookahead-carry,
Fast multi-way addition,
Multiplication by selective-multi-way-addition,
Low-complexity squaring in VHDL,
Multiplication by squaring,
Divison by binary-scaled super-radix,
Complex-exponentiation like old-multiplication,
Trigonometrics by successive approximation ...


We can estimate the metric efficiency of complex hardware circuitry, as gates*time: cf on the large-scale, doubling-the-hardware accomplishes twice-as-much in the-same-time or equally-as-much in half-the-time on-the-average; (A better estimate might add-up gates*time 'entropies')....

Blended-cyclic-and-complex processes may oft be simplified by pipelining --allowing new data to begin a cycle while the current is advancing,-- but which adds a gate-delay for each single-level fast-catch-latch-register and which must be either cleared before switching evironments or tagged, stages self-directing, to deposit its results across-environments....


An already well-known result--

The basic 3-bit-'full'-adder compacts 2-bits of a 2-way-sum-input with a carry-propagate-in into a 1-bit-sum-output and a carry-propagate-out, si = aibici + aibici + aibici + aibici, and ci+1 = aibi + aici + bici , (averaging 2.3-AND-terms-in plus 1.0-OR-terms-out per-bit, implemented typically in 1-gate-delay per-AND-OR), But such propagation is slow, so--

The remedy implements a fast Lookahead-Carry, but which, to keep the complexity low, requires a second gate-level: of AND-OR chains of lower-order pre-OR's of input-bits and initial carry-in's, ci = Σk (ai-1+bi-1)···(ai-k+bi-k)(ai-k-1bi-k-1 = ci-k else c0), (whereas one-level-without-the-AND-OR-of-pre-OR's explodes AND-terms by the distribution of the same pre-OR AND-products needed)....

The result is, that, while the 3-bit-adder does-compact 2:1, 2-bits-in-to-1-bit-out, that second level Lookahead costs a gate-delay and so on-the-average compacts only 1.41-bits-average-in per gate-delay, And, for long words e.g. 63-bit, the Lookahead builds input-terms to ~63³-huge: The remedy in most implementations combines partials, local-carry-out and local-propagate, and their complements, from groups-of-groups, which is single-level possible because OR-AND is equivalent to INVERT-AND-OR-INVERT, (Note that CARRY-INVERT is equivalent to INVERT-CARRY which chains INVERT-PROPAGATE though itself not equivalent to PROPAGATE-INVERT), to simplify the higher-bit AND-OR terms to ~63² more-manageably ... Some implementations deliver co-complement outputs, while others use reference-differential-inputs... (Some implementations can also use multiple-isolated-outputs to be wire-AND'ed under a constant-current pullup, and thence each pre-OR may put out upto-6-lookahead-lines, one for itself, one a pairing, plus 4-, 8-, 16-, and 32-blocks.)


In comparison to the 2-way-add, the 3-bit-adder itself, compacts 3:2, 1.5-bits-average-in-to-1-bit-out, and obviates all-but-one-many Lookahead-Carries until the last, form-normalizing sum-with-Lookahead-Carry, (Note 1.57 > 1.418) ... so-- we may find applications that are more-efficient by retaining the spray of carry-bits vertically-locally until the final add-with-Lookahead-Carry-to-normal-form... Such is multi-way addition--

Using 3:2's only, optimal compactions are: (by 1:2-with-Lookahead-Carry) 2-bits (2:3) 3 bits (2:3 +1) 4 (2× 2:3) 6 (3× 2:3) 9 (4× 2:3 +1) 13 (6× 2:3 +1) 19 (9× 2:3 +1) 28 (14× 2:3) 42 (21× 2:3) 63-bits ... (1-bit-shy of 64-bit but fully usable as-is in signed-multiplication), and slightly faster at 11-gate-delays instead-of-12, and much-simpler with 1-instead-of-62 Lookahead-Carry's....


Noting first the near-exponentially increasing circuitry-complexity at greater compaction, the next, fast-compactors are 5:3, 6:3, and fastest is 7:3. The immediate next-two, 8:4, 9:4, are less than 7:3 yet cost 2× and 4× more circuitry (And 8:4 compacts like 6:3, at 2:1).

1. The last-compaction is necessarily the 2-way-add-with-Lookahead-Carry 2:1-compaction in 2-gate-delays,
2. Its preceding is necessarily (at-best) the 3:2-compaction in 1-gate-delay,
3. Its prior-preceding is necessarily (at-best) the 7:3-compaction in 1-gate-delay,
4. And thereafter multiple 7:3-compactions as needed for the original multi-way-add....

Building-up from a final normal-form-output, shows this is optimal at--
2-gate-delays: 2-bits (the basic 1:2-with-Lookahead-Carry),
3-gate-delays: 3-bits (1:2-with-Lookahead-Carry, 2:3),
4-gate-delays: 7-bits (1:2-with-Lookahead-Carry, 2:3, 3:7),
5-gate-delays: 15-bits (1:2-with-Lookahead-Carry, 2:3, 3:7, 2× 3:7 + 1),
6-gate-delays: 35-bits (1:2-with-Lookahead-Carry, 2:3, 3:7, 2× 3:7 + 1, 5× 3:7),
7-gate-delays: 80-bits (1:2-with-Lookahead-Carry, 2:3, 3:7, 2× 3:7 + 1, 5× 3:7, 11× 3:7 + 2:3),
... etc., or, 1-fewer-bit-per for matrix-vector-products-(sums), multiply-and-cumulate, etc....


Since the multiplication-parallelogram (rhombus if both words are the same bit-width) is computationally thickest around the vertical 'diagonal' column, we can use 7:3's there, but less-massive 6:3's around that etc. down to minimally-massive 3:2's at the tail-ends....

So, for example, 63-bit-unsigned-multiplication the old-way takes 2×6 = 12 level-gate-delays, but we can do it in 7-levels: to wit: 63 bits = 9× 7:3 = 27 = 4× 7:3 = 12 = 2× 6:3 = 6 = 1× 3:2 = 2 = 1× 2:1-with-Lookahead-Carry.



An already well-known result--

Using 3-instead-of-4 halfword-multiplies... Or, we can do it in 2-instead-of-3-or,-4 halfword-multiplies....



In the same way, 80-bit-squaring and 70-bit-multiplication take 6-gate delays....

Squaring is low-complexity because it takes one, parameter, the difference between successive squares is increasing, it has some supplementary 'symmetry' (1-n)²....

Compared to full multiplication, Squaring is half-or-less complex, [A1.A0]² = [A1]² + 2 [A1*A0] + [A0]² where its complexity-pyramid around the middle term peaks at half-or-less for half-as-many bits and so is quarter-or-less complex over all bits....

In fact for example, a 4-bit squaring logic is single-level:

00000 00000 00000
00001 00000 00001
00010 00000 00100
00011 00000 01001
00100 00000 10000
00101 00000 11001
00110 00001 00100
00111 00001 10001
01000 00010 00000
01001 00010 10001
01010 00011 00100
01011 00011 11001
01100 00100 00000
01101 00101 01001
01110 00110 00100
01111 00111 00001
10000 01000 00000
10001 01001 00001
10010 01010 00100
10011 01011 01001
10100 01100 10000
10101 01101 11001
10110 01111 00100
10111 10000 10001
11000 10010 00000
11001 10011 10001
11010 10101 00100
11011 10110 11001
11100 11000 10000
11101 11010 01001
11110 11100 00100
11111 11110 00001
Q0 = ?D0
Q1 = 0
Q2 = ?D1!D0
Q3 = ?D2!D1?D0 + !D2?D1?D0
Q4 = ?D2!D1!D0 + ?D3!D2?D0 + !D3?D2?D0
Q5 = ?D4?D3!D2?D1!D0 + ?D4!D2!D1?D0 + ?D4!D3?D2?D1!D0 + ?D4!D3!D1?D0 + ?D4!D3!D2?D0 + !D4?D3?D2?D0 + !D4?D3!D2?D1 + !D4!D3?D2?D1
Q6 = ?D4?D3?D0 + ?D3!D2!D1 + ?D4!D3?D1!D0 + ?D4!D3!D2?D1 + !D4?D3?D1 + !D4?D3!D2
Q7 = ?D4?D3?D1 + ?D4!D3?D2!D1 + ?D4!D3?D2!D0 + !D4?D3?D2
or = ?D4?D3?D1 + ?D4!D3?D2*![?D1?D0] + !D4?D3?D2
Q8 = ?D4?D3?D2 + ?D4!D3!D2 + ?D4!D3!D1 + ?D4!D3!D0
or = ?D4?D3?D2 + ?D4!D3*![?D2?D1?D0]
Q9 = ?D4?D3 + ?D4?D2?D1?D0


A*B = (A+B)² - (A-B)² / 4



By coarse-estimating the inverse of the divisor, in an 8-11-bit-index lookup table, with the high-bit always '1', and fast multiplying-back-through-with-subtraction, by a partial-word-multiplier (selective-multi-way-adder), 'long' division can be reduced to about-6-8 cycles, each about-7-levels, for about-35 gate-delays total....


Multiplication-like CORDIC * (chordic)

multiplicative-like, progressive cumulation

Multiplication on early digital computers was typically by sequential approximation and repetition of addition, progressive cumulation like manual long-multiplication and long-division implemented in hardware (equipment) for integers, and in software emulation aided by floating-point addition, -or fully hardware depending on cost, space, availability, factors-... In the present era it has become replaced by parallel micro-circuitry.

But by freeing this procedure from fixed multiplier and multiplicand, and speed-ups with simple fast bit-scaler ("barrel-shifter") and sequential-access memory ("SAM") or other, more functions can be efficiently calculated:-- Exponentiation; circular-function sinusoid trigonometric; hypertrigonometric; arithmetic-like data-compression, ... by algorithm. (Even square-root is such an algorithm.)

Exponentiation could be accomplished by cumulative successive partial multiplications, Π(1 + d xi) where xi are chosen to simplify the multiplication to just addition of d=2-i, a bit-scaling (shift) and add or subtract ... A table of xi is typically n-bits long ... It might be even practical to build the calculation directly in hardware, even as multiplication was so built-- and useful, as sinusoids are useful in video construction.

The fastest bit-scaler is built on base-8 (near base e² ~7.39), -requiring the least hardware reduplication and substrate 'realestate'.

* [Author's note: My original article was more inclusive than the ca-1950's utility, CORDIC CoOrdinate Rotation Digital Computer [algorithm] used for circular and hyperbolic x-y pair rotations,- but as it was conceptually related as progressive cumulation of additions, I kept the other's -title- as a subject-reference: believing it had meant the -subject- of successive narrowing of measure along a cord aka chord, a progressive binary-search-like calculus-analytic method for measuring nonlinear arcs by computation, and especially for reducing its error coefficient. (cf Then-new college subject, my UCSD Department of Applied Physics and Information Sciences APIS, also-noted Egyption bull, only years after I graduated, became EECS Electronic Engineering and Computer Sciences; but we'd built computers and interfaces from small-scale-integration in that era; and SAR Successive Approximation Register was ordinary technology, decades before calling that, pyramidal.) In my algorithm discovered-on-demand for myself in college I'd used the process for exponentials, and later heard and read of the CORDIC describing the same method with paired complex rotation angle increments chosen similarly to simplify the whole to multiplication by successively selected single-additions]

* [When an engineer at Linkabit asked whether there were a faster way to implement a 4-bit multiplication for our microprocessor, I replied, (A+B)² - A² - B² / 2, because our microprocessor had a 32-place lookup table for fast squaring]

Grand-Admiral Petry
'Majestic Service in a Solar System'
Nuclear Emergency Management

1978±, 1996, 2005,2011 GrandAdmiralPetry@Lanthus.net