* experimental changes to try and reduce allocations in kernel threading and DMA handler
* Simplify the changes in this branch to just 1. Don't make unnecessary copies of data just for texture-texture transfers and 2. Add a fast path for 1bpp linear byte copies
* forgot to check src + dst linearity in 1bpp DMA fast path. Fixes the UE4 regression.
* removing dev log I left in
* Generalizing the DMA linear fast path to cases other than 1bpp copies
* revert kernel changes
* revert whitespace
* remove unneeded references
* PR feedback
Co-authored-by: Logan Stromberg <lostromb@microsoft.com>
Co-authored-by: gdk <gab.dark.100@gmail.com>