NET 2003, but it still isn't able to resolve binary ops of the form push ebp mov ebp,esp pxor xmm0,xmm0 movdqa xmm1,xmm0 movd xmm0,dword ptr [ebp 8] punpcklbw xmm0,xmm1 pshuflw xmm1,xmm0,0FFh pmullw xmm0,xmm1 psrlw xmm0,8 movdqa xmm1,xmm0 packuswb xmm1,xmm0 and esp,0FFFFFFF0h movd eax,xmm1 mov esp,ebp pop ebp ret push ebp mov ebp,esp pxor xmm0,xmm0 movdqa xmm1,xmm0 movd xmm0,dword ptr [ebp 8] punpcklbw xmm0,xmm1 pshuflw xmm1,xmm0,0FFh pmullw xmm0,xmm1 psrlw xmm0,8 movdqa xmm1,xmm0 packuswb xmm1,xmm0 and esp,0FFFFFFF0h movd eax,xmm1 mov esp,ebp pop ebp ret The code is at least correct this time, but it is still full of unnecessary data movement, which consumes decode and execution bandwidth.

The code has shipped and is in 1.5.10, but is hard-coded off in .

I might resurrect it again as NVIDIA reportedly exposes a number of features in their hardware in Open GL that are not available in Direct3D, such as the full register combiners, and particularly the final combiner.

One of the features I've been working on for 1.6.0 is the ability to do bicubic resampling in the video displays using hardware 3D support.

We've been using simply bilinear for too long, and it's time we had better quality zooms accelerated on the video card.

The problem is that different drivers and applications are inconsistent about how they treat or format odd-width and odd-height YV12 images.

Some support it by truncating the chroma planes (dumb). Now, if people had sense, they would have handled this the way that MPEG and JPEG do, and simply require that the bitmap always be padded to the nearest even boundaries and that the extra pixels be ignored on decoding.

I know a few of you are going to yell out "use compiler intrinsics," but please look at this first: push ebp mov ebp,esp and esp,0FFFFFFF8h sub esp,8 movd mm1,dword ptr [ebp 8] pxor mm0,mm0 punpcklbw mm1,mm0 movq mm0,mm1 movq mm2,mm0 punpckhwd mm2,mm1 movq mm1,mm2 punpckhwd mm1,mm2 pmullw mm0,mm1 psrlw mm0,8 movq mmword ptr [esp],mm0 emms movq mm0,mmword ptr [esp] movq mm1,mm0 packuswb mm0,mm1 movd eax,mm0 mov esp,ebp pop ebp ret push ebp mov ebp,esp and esp,0FFFFFFF8h sub esp,8 movd mm1,dword ptr [ebp 8] pxor mm0,mm0 punpcklbw mm1,mm0 movq mm0,mm1 punpckhwd mm1,mm0 movq mm2,mm1 punpckhwd mm2,mm1 pmullw mm0,mm2 psrlw mm0,8 movq mmword ptr [esp],mm0 emms movq mm0,mmword ptr [esp] movq mm1,mm0 packuswb mm1,mm0 movd eax,mm1 mov esp,ebp pop ebp ret This, historically, is why I have not bothered to use MMX/SSE/SSE2 compiler intrinsics in Virtual Dub — the code generation sucks.

The VC6 processor pack was quite bad and tended to generate about two move instructions for every ALU op; this was improved in VS.

Although I have it installed for some time now, I've been avoiding using Visual Studio . The incremental improvements in the compiler simply aren't worth putting up with the braindead, butt-slow IDE.

Thus, I've been continuing to use Visual C 6.0 SP5 PP.

We can do this on a GPU by doing the horizontal pass into a render target texture, then using that as the source for a vertical pass.