.. _performance:

Improving CUDA Performance
================================

One can regard the action of a CUDA kernel on a mesh as the
distribution of elementary tasks (one mesh cell = one elementary task)
to the CUDA cores of the GPU. The CUDA cores are distributed within
*streaming multiprocessors* (SMP) on board the GPU. For instance, on a
Kepler K20 GPU, there are 13 SMP, with 192 cores each (for single
precision data), hence there are in total 2496 CUDA cores. In a
similar manner, splitting the whole task into threads that perform
elementary tasks on the CUDA core obeys a two-level hierarchy: the
global mesh must be split in *logical blocks*,  and the blocks are
then split in threads. The user has to determine the size of the
blocks in X, Y and Z. A given block runs on a single SMP. If you
choose blocks that are too small, the SMPs are underused and the
performance is degraded. If you choose blocks that are too large, the
small amount of memory within the SMPs (48k) is saturated and the
extra data is stored within the device's global memory (the "Video
RAM"), with a dramatic performance penalty. There are other
considerations that matter the choice of the CUDA blocks (for instance
memory alignment), but in short it is obvious that there is an optimal
block size that will maximize performance. This size depends on:

   * the GPU 
   * the kernel itself (it is not the same for all kernels of FARGO3D).

By default, the block sizes used in a kernel execution are the numbers
provided in the .opt file, which are "reasonable" numbers, but they
are the same for all kernels (hence they cannot be optimal).

A makefile rule combined with python scripting has been developed in
order to do perform a systematic test of the performance of each
kernel, individually, as a function of the size of the CUDA blocks.

.. warning:: ``make blocks`` does not work when the ``.opt`` option ``-DMPIIO`` is enabled.
   
At compilation time, a file called setup.blocks (setup is the name of
your setup) is looked for in the corresponding ``src/setup`` directory
in order to provide to ``c2cuda.py`` the best block size for each
kernel. You could hand-write this file, but in practice, it is
automatically generated by the makefile when you execute the rule
called "blocks"::

  make blocks setup=SETUP

It is necessary to use "setup" in lower case in order to avoid a
misunderstanding with the SETUP variable. Example::

  make blocks setup=fargo

And you will see lines similar to::


	CompPresIso	64	8	1	 appended
	CompPresAd was skipped.
	compute_slopes was skipped.
	compute_star was skipped.
	compute_emf was skipped.
	update_magnetic was skipped.
	substep1_x	16	8	1	 appended
	substep1_y	32	4	1	 appended
	substep1_z was skipped.
	substep2_a	64	8	1	 appended
	...

and a file called ``fargo.blocks`` inside ``setups/fargo`` is created
and is filled with this information, which represents the best block
size for each kernel. All the functions skipped were skipped because
they are not used in this particular setup.

It generally takes a few minutes. At the end, you have a .blocks file
similar to::

	CompPresIso	64	8	1
	substep1_x	16	8	1
	substep1_y	32	4	1
	substep2_a	64	8	1
	...

Now, each time you compile the code, this file is taken by the
c2cuda.py script. In the best cases, you can increase the
performance in a 10/20%. In 3D massive MHD problems, you will have a
maximum gain.

Note: The ``.blocks`` file could be saved for the future if you want to
save time. In theory, the .blocks file is hardware dependent. Be
careful if you share the same file on multiple platforms.


MPICUDA
-------

The considerations about GPU Direct and improvement of MPI
communications between GPUs have been exposed in section :ref:`mpicuda`.