Week4 - SIMD, SVE, SVE2
In the newer processor due to backwards compatibility, it must use some of its space for it to store code or logic for it to work.
There are flags in every platform, and they are all different. It shows what the CPU capabilities is. Flags are group into level 0,1,2,3. The higher level the more capabilities. You can choice what level to target when developing.
On and off are like 1 and 0, it has a flow of electricity to a component than it’s on, but this requires a flow of electricity, and it can be more efficient to use state. It charges up a component and if it has a charge than it’s a 1. It will be recharge if the charge it is getting low. The method is better way to use power than having a flow. The smaller the component the less power it uses. There is a limit of how small it can be, too small that we can’t tell it’s a 0 or 1.
Autovectorization - The complier will do most of the work for us to use vectors. It will look for user case that can apply using vector and it will apply it. This is most used and has been improve in recent time.
Enabled “-ftree-vectorize”
Diagnostics “-fopt-info-vec-all” or fpot-info-vec-missed”.
Profile guided optimization – use real human data to see how much an operations are use, and base of that we can do some optimization.
Inline assembly – using If else to target which and what to use. (__asm__ or asm) this show that it is inline. asm(template : outputs : inputs : clobbers)
Example: asm("addl %1, %2, %0" : "=r" (where it’s going) : "r" (from), "r" (from));
Intrinsics – extending C to add capabilities. They are kind of like functions.
Inline and Intrinsics are apply and work on at the library level.
SIMD - single instruction multiple data (AVX)
This is more a traditional approach of processing data on element at a time. The CPU give a single instruction and data would be apply simultaneously in a vector/array. This way we don’t have to up our clock speed. We can up our width of our processing. We can split the 128 bit in to 2/4/8/16 (V0.2D – double word/V0.4S – single word/V0.8H – half word /V0.16B - byte). This is kind of a loop enrolling. Do more in one instruction. They are also call lanes with we split up the 128 bits.
SVE – scalable vectors
This is different it has a predicate register for a vector and is on the lowest bit in each element controls the on and off (It will be 1 or 0). This is how they can tell if it is using it or not. It can be control with the WHILELO. It will only use what it is needed. This is some next level of vectorization.
SVE2 – predicate registers
Added on more capabilities on SVE. Like some AI stuff.
qemu-aarfh64 – This software emulation will kick in if the instructions that can’t be run on the native processor. This gives you the options to run code on the x86 machine.
It is crazy that tech grow so fast, there is something GCC can’t do something in last semester but can do this semester. People that work and code the GCC must be crazy good at this stuff. We are only focus on learning the optimization and portability of software. Doing vectorization makes me thing of muti threading or async function. Seeing how the vectorization grows and how people come up with better solution is crazy. For example, the AVX10 on x86_64 how it can be portability to older and earlier SIMD (SVE on AArch64 platform). I find SVE is so cool because it can be run on any kind of register width even something in the future size that has not been made.
Comments
Post a Comment