Developing a complex setup =========================== This is the advanced part of the manual. After this section, you will be able to develop kernels, and complex routines inside FARGO3D. One of the most useful features of FARGO3D is the fact that you do not have to develop CUDA kernels yourself, neither even learn CUDA. All the CUDA code is written automatically with the help of ``scripts/c2cuda.py``. But developing FARGO3D is not only a matter of kernels. We will not write an extensive documentation on each topic, but instead, we will develop a routine in detail and, at the same time, give a brief explanation . We will develop routines for simulating an exploding 2D periodic media, where the explosions are governed by a random generator, and we will develop a cooling function for the gas. This cooling function will be a kernel. The explosions will be modeled by energy spheres, appearing at random positions, and at random times. Also, we will include a magnetic field. The name of our setup will be "explosions". Setup folder ------------- The very first step is to create the directory to store the setup. We will store all the files inside setups/explosions. This setup is similar to the otvortex setup, so we can copy all the files inside this setup and modify them. We need to keep ``explosions.bound.0,explosions.opt, explosions.par`` and ``condinit.c`` files:: $ cd setups $ mkdir explosions $ cp otvortex/* explosions/ $ cd explosions $ mv otvortex.par explosions.par $ mv otvortex.opt explosions.opt ... etc for all files that begin with 'otvortex' Do not forget to change the ``outputdir`` and the ``SetUp`` parameter of the .par file (OutputDir outputs/explosions, SetUp explosions). Also, we will add some viscosity and resistivity to our setup (add the option ``-DVISCOSITY`` to the .opt file, and set nu=0.001 and eta=0.001 in the parfile). We can check that the setup can run:: $ make SETUP=explosions view $ ./fargo3d setups/explosions/explosions.par We see the otvortex setup, but with the name explosions. Initial Condition ----------------- Because we want explosions at random times, this kind of setup cannot be entirely dealt with in ``condinit.c``. We will develop in the next section an additional routine, called from the main loop, to handle explosions. In the initial conditions we define a first explosion, at a random position. The routine that we will create in the next section will resemble this one, which we write in the file ``setups/explosions/condinit.c``. It should be similar to:: #include "fargo3d.h" void Init() { int i,j,k; real yr, zr, yr0, zr0; int yi, zi; real* vx = Vx->field_cpu; real* vy = Vy->field_cpu; real* vz = Vz->field_cpu; real* bx = Bx->field_cpu; real* by = By->field_cpu; real* bz = Bz->field_cpu; real* rho = Density->field_cpu; real* e = Energy->field_cpu; real sphere_radius = 0.05; yr = YMIN+drand48()*(YMAX-YMIN); zr = ZMIN+drand48()*(ZMAX-ZMIN); yr0 = yr; zr0 = zr; if (yr > YMIN + (YMAX-YMIN)/2.0) yr -= (YMAX-YMIN); else yr += (YMAX-YMIN); if (zr > ZMIN + (ZMAX-ZMIN)/2.0) zr -= (ZMAX-ZMIN); else zr += (ZMAX-ZMIN); sphere_radius *= sphere_radius; //we take the square. for (k=0; kfield_cpu; real Rate = 1.0; //Average number of explosions per unit time real sphere_radius = 0.05; sphere_radius *= sphere_radius; //we take the square. p = drand48(); if (p < 1.0-exp(-DT*Rate)) { real sphere_radius = 0.1; yr = YMIN+drand48()*(YMAX-YMIN); zr = ZMIN+drand48()*(ZMAX-ZMIN); yr0 = yr; zr0 = zr; if (yr > YMIN + (YMAX-YMIN)/2.0) yr -= (YMAX-YMIN); else yr += (YMAX-YMIN); if (zr > ZMIN + (ZMAX-ZMIN)/2.0) zr -= (ZMAX-ZMIN); else zr += (ZMAX-ZMIN); sphere_radius *= sphere_radius; //we take the square. for (k=0; k //#define __GPU //#define __NOPROTO //<\FLAGS> // #include "fargo3d.h" //<\INCLUDES> void Edamp_cpu(real dt) { // INPUT(Energy); OUTPUT(Energy); //<\USER_DEFINED> // int i; int j; int k; //<\INTERNAL> // real* e = Energy->field_cpu; real edamp = EDAMP; int pitch = Pitch_cpu; int stride = Stride_cpu; int size_x = Nx; int size_y = Ny+2*NGHY; int size_z = Nz+2*NGHZ; //<\EXTERNAL> // for (k=0; k e[l] *= 1.0/(1.0+edamp*dt); //<\#> } } } //<\MAIN_LOOP> } Now, you must add this new routine to the makefile. In order to do this without breaking the generality of the code, a file called setup.objects was created. This file must to has two make variables:: MAINOBJ += routine_filename.o GPUOBJ += routine_filename_gpu.o where the second line is optional. In our case should be:: MAINOBJ += edamp.o GPUOBJ += edamp_gpu.o Before the compilation, do not forget to add in the parameter file (``setups/explosions/explosions.par``) the variable called ``EDAMP``. Its first value could be 0.1. Finally, we must add the execution line. In ``algogas.c``, before the invocation substep3(), we can add an invocation to our newly created function:: Edamp_cpu(dt); The code should compile in both CPU (sequential & MPI) and GPU platforms, including a cluster of GPU's. If you compile and run the code in GPU mode, you will see some lines similar to:: OUTPUTS 0 at Physical Time t = 0.000000 OK TotalMass = 2.0000000000 !!!!!!!!!!!!!!!!!!!!!!!!!!!!! All the "!" symbols mean that at every time step, you have "volumic" communications host<-->device (see sections :ref:`gpucomm` and :ref:`mpicuda`.) This is very expensive. This is because we force the invocation of the **CPU** version of our new function, because we call ``Edamp_cpu(dt)``, which triggers device to host and host to device communications by means of the INPUT/OUTPUT directives. If you want to avoid this, we must call the automatically generated CUDA kernel. Until now, we are running the CPU version of our kernel. Incorporating our kernel ---------------------------------------- Now, we will incorporate our new kernel into all the GPU-machinery inside FARGO3D. The set of rules here is very general, and in theory, if you follow them you should be able to develop any complex kernel. First, we will keep a clean version of FARGO3D. In order to do that you must to copy the file std/func_arch.cfg to your setup directory (setups/explosions/). Also, we will need the file src/change_arch.c:: $ cp std/func_arch.cfg setups/explosions/ $ cp src/change_arch.c setups/explosions/ We will alter the files of the setup directory, but we will leave the files of the main distribution untouched. If we issue an "ls" command inside setups/explosions, we should see something similar to:: algogas.c condinit.c explosions.bound.0 explosions.opt explosions.units change_arch.c edamp.c explosions.objects explosions.par func_arch.cfg main.c Now, we will add the prototype of this function in ``src/prototypes.h``. After the line:: ex void init_var(char*, char*, int, int, char*); add:: ex void Edamp_cpu(real); and after the line:: ex void addviscosity_sph_gpu(real); add:: ex void Edamp_gpu(real); .. warning:: The declaration of the ``_gpu`` must not be in the same block as the ``_cpu`` functions, otherwise the code will not build. Keep them grouped as they are. .. note:: We cannot copy the file ``prototypes.h`` to out setup directory because in the present version header files are parsed from the ``src/`` directory. However, it is harmless to declare extra functions in prototypes.h. If they are unused with other setups, no error nor warning message is issued. Now, we need to be able to select which function is called (``_gpu()`` or ``_cpu()``) by means of a function pointer. In the file ``src/global.h``, add the following line at the end:: void (*Edamp)(real); We now have all the variables required to edit the ``ChangeArch()`` function. This function allows to you to switch between a CPU or GPU execution of your new kernel, without recompiling the code. In the file ``change_arch.c``, add the following lines: Before the line:: while (fgets(s, MAXLINELENGTH-1, func_arch) != NULL) { add:: Edamp = Edamp_cpu; This line set the defaults value of the function pointer Edamp (it calls the ``_cpu`` function). If we want to use the GPU version of our function, we need to point to ``Edamp_gpu()``. In practice this is done with the ``func_arch.cfg`` file. To activate this possibility, add the following lines at the end (before the last #endif):: if (strcmp(name, "edamp") == 0) { if(strval[0] == 'g'){ Edamp = Edamp_gpu; printf("Edamp runs on the GPU\n"); } } We eventually add the following line into the ``func_arch.cfg`` file:: Edamp GPU Finally, we change the invocation of the energy damping into a more general invocation (before substep3(), inside algogas.c):: Edamp(dt); //Calls either _cpu or _gpu, depending on its value If you run the code again, you will see that nothing changed: "!" are issued, indicating expensive communications, which are a clue that our new function still runs on the CPU. This is because the ``funch_arch.cfg`` file is taken by default from ``std/``. In order to change that, you must include the value of the ``FuncArchFile`` parameter in explosions.par:: FuncArchFile setups/explosions/func_arch.cfg Now, if you run the code, you will see lines similar to:: OUTPUTS 0 at Physical Time t = 0.000000 OK TotalMass = 2.0000000000 !:::::::::::::::::::::::::::: And we have no more communications device<->host between outputs. The "::" means that actually we still have some less expensive communications between host and device for the periodicity. This can be avoided with a proper build, but no further implementation is required at this stage. See :ref:`mpicuda`. .. figure:: ../images/explosions.png :width: 280 :align: center :figwidth: 450 *A snapshot of the density field with the 'explosions' setup.* .. _fargodebug: Using FARGO_DEBUG ----------------------------------- We can use the macrocommand ``FARGO_DEBUG`` to check that the GPU kernel and its CPU counterpart yield same results to machine accuracy. Here this is done as follows: you simply have to wrap the invocation of ``Edamp(dt)`` in ``algogas.c`` within the macrocommand. As for the macrocommand ``FARGO_SPEEDUP`` presented in the section :ref:`benchmark`, we need to insert a comma at the end of the function name, so as to help the C preprocessor which cannot do string analysis:: FARGO_DEBUG (Edamp,(dt)) // <<=== Notice the comma We then compile the code with a GPU built (``make SETUP=explosions GPU=1``) and run it:: $ ./fargo3d setups/explosions/explosions.par .... .... ****** Check point created ****** Executing Edamp_cpu(dt) Dumping at #999 divb Emfy bz by bx QRight gasenergy gasdens Pressure Qs DensStar potential Moment_Minus_Z Moment_Plus_Z Vz_temp gasvz Moment_Minus_Y Moment_Plus_Y Vy_temp gasvy Moment_Minus_X Moment_Plus_X Vx_temp gasvx ****** Secondary Check point created ****** ****** Check point restored ******* Executing Edamp_gpu(dt) Dumping at #998 divb Emfy bz by bx QRight gasenergy gasdens Pressure Qs DensStar potential Moment_Minus_Z Moment_Plus_Z Vz_temp gasvz Moment_Minus_Y Moment_Plus_Y Vy_temp gasvy Moment_Minus_X Moment_Plus_X Vx_temp gasvx List of fields that differ: Skipping comparison of field Emfz used as a temporary work array in file (as declared at line 0) Skipping comparison of field DivRho used as a temporary work array in file ../src/reduction_generic_gpu.cu (as declared at line 86) What the code does is as follows: * Prior to entering ``Edamp_cpu(dt)``, it creates a checkpoint of all HD/MHD arrays. * It then executes Edamp_cpu(dt) and dumps all arrays with the arbitrary output number 999 (so that we can examine it in case of problem). *All* arrays are dumped, not only the primitive variables. * It creates a secondary checkpoint with all the data updated by ``Edamp_cpu(dt)``. * It "rewinds" the execution flow by restoring the first checkpoint created prior to the execution of ``Edamp_cpu(dt)``. * It now executes ``Edamp_gpu(dt)`` (note that automatic GPU-CPU communication is dealt with thanks to the INPUT/OUTPUT directives as explained in :ref:`gpucomm`). * It dumps all arrays, this time in output number 998. These arrays should be the same as those dumped in 999, if the CPU and GPU calculations yield indistinguishable results. * It performs a comparison of the arrays with the secondary checkpoint created previously. If any difference is found, a message is printed on the terminal. Some arrays are skipped from the comparison because they are used as temporary work arrays and may be different on the CPU and GPU, without any impact on the calculation. Here we see that all arrays are the same: GPU and CPU yield indistinguishable results. * In some other cases we may have differences to the machine accuracy. The example below shows the output when wrapping ``Substep1_x`` in ``FARGO_DEBUG``:: Fields Vx_temp differ: Minimum on GPU: -5.2721583979362551e-22 Minimum on CPU: 0 Maximum on GPU: 0 Maximum on CPU: 0 Minimum of GPU/CPU-1: -1 Maximum of GPU/CPU-1: -1 Minimum of GPU-CPU: 0 Maximum of GPU-CPU: 5.27216e-22 (Minimum of GPU-CPU)/max(abs(CPU)) 0 (Maximum of GPU-CPU)/max(abs(CPU)): 1 ********** We show hereafter how the macrocommand is expanded at build time:: { SaveState (); printf ("Executing %s_cpu%s\n","Edamp","(dt)"); Edamp_cpu (dt); DumpAllFields (999); SaveStateSecondary (); RestoreState (); printf ("Executing %s_gpu%s\n","Edamp","(dt)"); Edamp_gpu (dt); DumpAllFields (998); CompareAllFields (); prs_exit (0); }; .. note:: Why would GPU and CPU routines give different results if the GPU kernel is produced automatically from the CPU function ? Apart from issues related to passing values to the kernel (in particular the block), it may happen if you have a *race condition* inherent to your kernel. What is a race condition? It happens whenever the outcome of your kernel depends on the order in which threads are executed. If the INPUT and OUTPUT fields of your kernel are different, it is impossible to have a race condition. If, however, a field appears both as INPUT and OUTPUT, the values of its cells are used in the kernel, and also modified by it. This is the case of the kernel of ``Edamp()``. In this case, however, everything is local: the final value of one zone only depends on the value of that zone only, so it does not matter in which order the CUDA threads process the mesh. If this value also depended on the neighbors, we would have race conditions, and we would need to split the kernel in two and use an intermediary array. For instance, this is what we have done with ``SubStep2()``. Summary ---------- We have developed a somehow complex routine, interacting with the main parts of FARGO3D. This example shows how to write a kernel in five minutes, only taking care of prototypes and function pointers. Here is a brief summary of this process: 1) make a directory for the setup 2) copy the important files you need into the new directory. 3) If you will include routines, add the setup.objects file. 4) Add the prototypes (to the file ``src/directory.h``). 5) Add the global function pointer (to ``global.h``). 6) Modify ``change_arch.c`` 7) Add the function to ``func_arch.cfg`` and point correctly to this file in your parameter file with ``FuncArchFile()``. 8) Validate your new kernel using ``FARGO_DEBUG``.