|a viably attainable technology for stereo image capture|
[topically related to stereo-eyes-ed image-viewing by sexichrome and flood-focusing; and to television's original NTSC concept]
[NB This article is current technology, and refers to products available from commercial operations not affiliated with NEMO]
The present color video technology supplies successive images on a near-flat fixed-depth display screen - camera'ed on a single optical color-channel. But conceivably a color-camera can scope on dual, binocular, optical channels, (and) electronically (numerically) re-aligned, without changing the normal video-scan operation: That re-alignment signal contains the image depth information.
Consider each point in a binocular line-pair of vision: most points are seen in both lines - though not the same angle, that determines the apparent depth - while points stereoscopically half-hidden behind the edge of a foreground opaque object are seen in only one line. [Points behind transparencies have multiple-image focal line categories, typically three significant for the case of a simple window-pane; and behind translucencies the focal-line categories may be jumbled and unequal blurred]
Each camera'ed point may be represented by a fairly simple function of its stereoscopic position-of-origin: its consequent angle of focal dispersion, and a masking start-stop function, for maintaining object edge opacity as needed, kept fairly simple: one foreground object, slow-scanned (low signal bandwidth). Exceptions suggested to this paradigm include primarily multi-planed views through windows: a window reflects images of the viewer and behind, upholds dust, smears, film or drawn images, on its surface, near, and opens to the view, room, case, beyond: three significant perceivable image depths - but a director of video may choose for the vidience: focus near, mid, far.
The technological expense of multiple color-camera elements for HDDVideo production equipment, can be simplified and reduced to 2-CCD or 2-CMOS arrangements, utilizing 2.07 Mpx 16:9 CCD or CMOS, 1920×1080: one with the full 2.07 Mpx luminance resolution, and the other with an array of RGBG "Bayer" color-pattern pixel-quads filter-masks, for chrominance (color) resolution, coarser by a factor of two, than luminance. However, this simplest stereo-eyes-ed arrangement, putting luminance on one channel, color on the other, suffers co-domain-shadowing of the background along foreground objects edges: One shadow renders reduced luminance definition, which is non-objectionable; while the other edge is color-depleted, which is objectionable. (This is Color-Right I)
Better balancing of co-domain-shadowing, to preferentially retain chrominance, may be attained by using only RGBG color-CCD/CMOS on both optical channels: the overall luminance resolution is mostly maintained by the high-population of luminance-dominant green constituenting half the pixel count in each CCD/CMOS - as red and blue require a quarter or less spatial (and temporal) resolution (both are given a half) - averaging 75% (compared to monocular) due pixel-overlap between the left and right channels (presuming 100% stereo alignment). This is Color-Right II.
Stereo-eyes'ed video pick-up needs fairly low overall resolution, as the number of objects is an order lower than the number of pixels: Even small objects, as a character of text, connect tens of pixels, and the shape of objects usually confines them to categorical focus distances, small depth-angle changes - ie. a slope-limited depth-of-view signal - the exception being straight pipe-shapes into the distance; whereas a face needs only its roundness to appear stereo-eyes-ed: the nose has the same categorical distance as the face, except very close-up. However, the placement-angle precision of overlapping objects is more important: and this is determined more as amplitude precision, rather than amplitude accuracy, or object definition precision.
[Alternatively, only a few depth-levels are of interest to the eye, and these can be parameterized, and individual pixels assigned from this narrower selection - or separate scans, each at a separate depth, needs something akin to run-length compression coding]
I am proposing for SesQuaTercet, a stereo-eyes-ed color camera, Color Right II, with 2 RGBG-color-masked 2 Mpx CMOS (lower cost and more directly compatible circuitry, than the prior CCD), one each on two coplanar optical-channel lines of view. The stereo definition is then derived at lower luminance resolution from mathematical correlation of overlapping 16-pixel-rectangles, binomially amplitude-weighted sum (approximating the normal distribution), from both left and right optical channels - this is a sliding-function usably precise to about 2-pixel-width, as the luminance detail has been so reduced by a factor of about five (to half-amplitude). [Alternatively 8x2- or 4x4-pixel-rectangles may be used for estimating luminance, for stereo-alignment processing - but the verticality of all non-singular images must be maintained, to maintain stereo'eyes'ed image realism]
This proposal intends 2 CMOS, one each optical channel - albeit, its optics for Color-Right II application should be upgraded: a balance between HDDV/HDTV 2 Mpx luminance maximized resolution, and chrominance precisely deresolved, focus-spread normally (coarsely triangularly), so that each distant photon point-source illumines 2x2-pixels across (half-width, pyramid area-response: 4x4-pixels, full-base-width). [Additionally it may be preferable that left and right channels are oppositely Bayer-pattern polarized, -by shifting up-left one row and column (tbd)]
... [under further construction] ...
The stereo-coefficients reduction algorithm may be assisted by several rules: presuming precalibrated CCD/CMOS. Sums of consecutive pixels must near-equal, left and right, except for the end-points, and except for steep depths, where the subtended angle differs significantly. A series (summation) algorithm which, walks, stepping left, right, etc. must alternately supercede [SIC], except for the end-points. More accurately, fine-tuning the left-right alignment, each pixel in one (left or right) must sit proportionally between two in the other (except when exactly aligned on pixels).
Object depths fit categories: about 10-16 depths of interest, from near-at-hand to far-stars, most foci exhibit constant depth-levels for the entire object - a possible exception being very close-up, a road of houses into the distance, which has many more depth-steps, or approaches a continuum of receding depth.
A note on video signal reduction:
The current high-end technology for compressing photographic content (patented, not by NEMO) is the SPIHT and its (suboptimal) variants, -which (on my best re-search at length) is,- the intuitively simple derivative arithmetic on quads of elements: sum or total (++++), slopes or tilts horizontal and vertical (+-+-), (++--), and saddle or twist (+--+); -(the literature seems to call these, wavelets, but I'd believed wavelets were truncated sinc-like functions; and I'd sooner call these SPIHT coefficients: derivlets, slopelets or tiltlets, -or tendlets: zeroeth order and up);- four coefficients replacing four pixels, an increase of 2 bits precision each per stage, applied first to quads of the most basic pixels, points; then regrouping coefficients, like-with-like, and repeatedly applying the same simple arithmetic, stacking-up levels like a pyramid, increasing precision by 21 bits total for 2 Mpx, or energetically by sigma = 9.5 bits (sqrt N / 2); -then entropic coding successive bit-slices, high-order first: adapted firstly to sparse '1's at the highest bits, because very few coefficients are full amplitude, secondly for a smallish average in the second-significant-bit, less than 50% as amplitudes are overall bounded and statistically avoid the larger values [a small statistical improvement], thirdly likewise but at the mid-high bits, and fourthly uncompressed at 50:50% '1's and '0's in the lower significant bits, essentially uncompressible. The SPIHT method gleans a list of significants, coefficients already bit-started, to adjust their compression accordingly and keep the unstarted in open '0'-meadows. (My choreonumeric entropic coding might do nicely [tbd].) I've noticed (in research) the least significant bits are not described as receiver-dithered; but, should be, by .35 LSB for all but the zero coefficients- -not by .5 LSB, which is the next-beyond-LSB and half-the-time too much, giving objectable peaking by .25LSB in 25% of all cases (peaking tends more objectionable than softening because peaking introduces artifacts and cumulative artifact energy, softening does not), and .5 LSB is also excessive compared to the abundant unstarted near-zero adjacent coefficients, which cannot receive dither at all, certainly not by .5 LSB, and must be remanded to later starting and dither; -while .25 dither on started significants is unnecessarily small for the largest needing it, compared to those unstarted which may need as much, or more, -and therefor .35 is "just right". (The dither can also be shown to better serve predictive bit-estimation, as well as value-estimation.) SPIHT also emphasizes by double the coefficient amplitudes of the tilts, and quadriple of the twist. As a general rule in video representation, the busier the image the less accurate its amplitude: Signal bandwidth tends to constant across all coefficients together. SPIHT is also progressive resolution, extensible by appending cascaded detail,-- though in most implementations it also starts with a larger-than-unit full-color miniature.