0:02
Let's see how automatic vectorization works in code.
Automatic vectorization is a functionality of
compilers that allow them to translate your scalar code into vector instructions.
So a vectorizing compiler will look for loops with that parallelism,
and it will try to lump together multiple scalar iterations into vector iterations.
With Intel compilers automatic vectorization
is enabled at the default optimization level,
so you don't need to do anything for automatic vectorization to happen.
This means that your code may already be vectorized without you even knowing that.
If you want to see what the compiler did to your code use the argument -qopt-report.
It will tell the compiler to produce a text file with extension dot opt report.
In this text file, you will find blocks: loop begin loop end
describing what the compiler did to your loops and functions.
There's two numbers to pay attention to,
12 is the line number.
The first argument, the second argument is the column number.
So 12 is the line, 3 is the column.
And the statements beginning at this position was vectorized.
So, it means success.
The compiler was able to take this loop and translate into vector instructions.
When the compiler looks for loops to get to vectorize,
it will only try to vectorize the innermost loops.
So if you haves multiple layers of loop nesting,
it will only look at the innermost loop and there it will try to
lump together multiple scalar iterations into a single vector iterations.
You can override this behavior with pragma omp simd which we will discuss later.
The number of iterations in the loop must be known when you begin the loop.
If you know it at compilation time, this is ideal,
but even if you know it at runtime, this is still good.
For loops are going to be vectorized as long as they don't have preliminary exits.
But if you have while loops,
they are out of the question,
because you don't know when the while loop will end.
The compiler will try to check for vector dependence in your loops
and vectorization will fail if vector dependence is detected.
We will talk about it later.
If you have function calls in your loops these functions
must be SIMD enabled and we will talk about SIMD enabled functions soon.
The beauty of automatic vectorization
is that you don't need to target your code for a particular architecture.
You can recompile a single code for multiple architectures by
just changing one compiler argument -X followed by the code name.
For example, if you want to compile your code for vector instructions
found in Intel Xeon Phi processors, use - XMIC-AVX512.
You can also use -Xhost and it will target
the same architecture that is found on your compilation node.
So you're compiling on a certain computer,
you will target the same computer with -Xhost.
If you want to target multiple architectures and implement
a runtime called dispatch, use ax[code].
Even though earlier, I demonstrated
just a primitive example of automatic vectorization a plus b,
in fact, auto vectorized loops may be quite complex.
Here is a snippet from an embedded simulation.
In this simulation, we have n particles
interacting with each other through Newton's Law of Universal Gravity,
and there is a complex expression for the force of this interaction.
The compiler will be able to recognize this loop in i,
it will lump together iterations for
different values of i into a single vector iteration.
To do that it will have to recognize that this is not dependent on i,
so this scalar quantity will be
translated into a vector where every vector lane contains the same value x[j],
but it is different for x[i].
It will be translated into a vector that has x[i],
x[i ]+ 1 and so on up to x[i] +
15 and then this subtraction can be executed with a vector instruction.
Down the line it will use these expressions to compute the transcendental function,
square root, and proceed with the rest of the calculation.
Automatic vectorization is a powerful and flexible tool.
It allows you to have a single code for multiple computer architectures.
All you have to do to compile for a different architecture is change a compiler argument.
Keep watching to learn what controls you have to direct automatic vectorization.