Cuda translator (C2CUDA.py). ============================ One of the most interesting features of FARGO3D is that it can run on GPUs. If you look the ``src/`` directory, there are only very few routines related with CUDA. The philosophy of the development of FARGO3D was to avoid the kernel-writing process. Instead, following a set of very simple rules, we were able to develop a python script that translates the time-consuming C routines into CUDA kernels. This section is a brief description of the rules and the process of building CUDA kernels from C functions. FARGO3D Mesh Functions ----------------------- In FARGO3D all the time-consuming functions that are called during a hydro or MHD time step involve, without exception, a 3D nested loop over the computational mesh. We call hereafter these functions "mesh functions". The C to CUDA translator has been developed to convert these "mesh functions" to CUDA kernels. The components of any mesh function are: * A header * A function type and name * Arguments * Global variables * Local variables * A loop over the mesh In some cases, the structure is a bit more complicated, including some additional lines at the end of the main loop, but this will be discussed below. The easiest example of a mesh function is the Pressure computation:: #include "fargo3d.h" void ComputePressureFieldIso_cpu () { real* dens = Density->field_cpu; real* cs = Energy->field_cpu; real* pres = Pressure->field_cpu; int pitch = Pitch_cpu; int stride = Stride_cpu; int size_x = Nx; int size_y = Ny+2*NGHY; int size_z = Nz+2*NGHZ; int i; int j; int k; for (k=0; kfield_cpu; real* cs = Energy->field_cpu; real* pres = Pressure->field_cpu; int pitch = Pitch_cpu; int stride = Stride_cpu; int size_x = Nx; int size_y = Ny+2*NGHY; int size_z = Nz+2*NGHZ; This block is related to all the global variables that are not passed by argument to the function. In practice, here you have all the fields required to achieve the calculation, the size of each loop, and some useful variables related to indices. **Local variables**:: int i; int j; int k; This block is devoted to variables that are neither passed as arguments nor global. Indices of the loop are always here, but it is possible to add any variable you want. **Main Loop**:: for (k=0; k //#define __GPU //#define __NOPROTO //<\FLAGS> // #include "fargo3d.h" //<\INCLUDES> void ComputePressureFieldIso_cpu () { // INPUT(Energy); INPUT(Density); OUTPUT(Pressure); //<\USER_DEFINED> // real* dens = Density->field_cpu; real* cs = Energy->field_cpu; real* pres = Pressure->field_cpu; int pitch = Pitch_cpu; int stride = Stride_cpu; int size_x = Nx; int size_y = Ny+2*NGHY; int size_z = Nz+2*NGHZ; //<\EXTERNAL> // int i; int j; int k; int ll; //<\INTERNAL> // i = j = k = 0; #ifdef Z for (k=0; k ll = l; pres[ll] = dens[ll]*cs[ll]*cs[ll]; //<\#> #ifdef X } #endif #ifdef Y } #endif #ifdef Z } #endif //<\MAIN_LOOP> } As you see, all the main blocks are identified by some special comments (C comments on one line begin with ``//``), but also there are two additional blocks. We can make an abstract portrait of a general FARGO3D's C mesh function:: // Your preprocessor variables here (with a heading ``//`` sign). __GPU and __NOPROTO must be defined, as in the example. //<\FLAGS> // your includes here ("fargo3d.h" must be in the list) //<\INCLUDES> function_name_cpu(arguments) { // <== the name MUST end in ``_cpu`` // Some general instructions. //<\USER_DEFINED> // type internal_name = external_variable; type* internal_pointer = external_pointer_cpu; // <== the name MUST end in ``_cpu``. //<\EXTERNAL> // type internal_name1 = initialization; type internal_name2; //<\INTERNAL> // //type internal_name1(1); //Note: these lines begin with a ``//`` //type array_name2(size); //<\CONSTANT> // #ifdef Z for (k=0; k Anything you want here. <\#> #ifdef X } #endif #ifdef Y } #endif #ifdef Z } #endif //<\MAIN_LOOP> // Some final instruction(s). //<\LAST_BLOCK> } Below, there is an explanation of each field, and how to use it. You can browse the source files of mesh functions to see examples. You can have the list of the corresponding files by having a look at ``src/makefile``. There, you see a block called GPU_OBJBLOCKS. All the mesh functions are found in the files that have the same prefix as those found in this list, but with a ``.c`` suffix instead of ``_gpu.o``. FLAGS ...... **Identifier** :: ... <\FLAGS> **Parsed as**:: Textually, with heading comment sign removal. **Location**:: Normally, at the top of a C file. There are two flags that must be always included:: //#define __GPU //#define __NOPROTO They are important for a proper header building. The __GPU can be used inside a C mesh function to issue specific lines that should be run only on the GPU version, with the help of the macrocommand:: #ifdef __GPU #endif INCLUDES ......... **Identifier** :: ... <\INCLUDES> **Parsed as**:: Textually **Location**:: Normally, after the FLAGS block. The INCLUDES block is the block where all the headers are. This block must contain at least:: #include "fargo3d.h" You may include any other header file. Function name: .............. **Identifier** :: Implicit, only a string **Parsed as**:: function_name_cpu --> function_name_gpu (for the wrapper or launcher) --> function_name_kernel (for the associated CUDA kernel) **Location**:: Normally, after the INCLUDES block. Arguments: .......... **Identifier** :: Implicit, only a string **Parsed as**:: Field* f --> real f->field_gpu non pointer argument --> textually **Location**:: Normally, after the INCLUDES block. Any built-in type is allowed, including FARGO3D's real type. The only structure allowed is the "Field" structure. There are two constraints about the arguments field: * All must be on the same line. * The Field structures are the last arguments. USER DEFINED: ............. **Identifier** :: ... <\USER_DEFINED> **Parsed as**:: Textually **Location**:: After the function name. This block can be very general. You can do here a lot of things, because there is no limitation on syntax, everything inside is parsed textually. In practice, we use this block mainly to do memory transfers between host & device when they are needed, by issuing INPUT and OUTPUT directives. This block is a kind of pre kernel-execution instructions, will be executed before the kernel launch, by the launcher (or wrapper) function. In a similar way, the post kernel-execution is the block called LAST_BLOCK. EXTERNAL: ......... **Identifier** :: ... <\EXTERNAL> **Parsed as**:: All the external variables are parsed as arguments of the kernel. type variable = global_variable_cpu --> function_name_kernel(type global_variable_gpu) type* variable = global_variable_cpu --> function_name_kernel(type* global_variable_gpu) **Location**:: After the USER_DEFINED block. The CUDA kernels cannot see the global host variables. This block is meant to grant access to these variables to the kernels so that all the variables dealt with in this block are global. The main rule when your draw a list of global variables inside the EXTERNAL block is: * Avoid the use of all capital variables. Instead, declare another variable with the same name but without capitals, and declare the variable equal to the global variable. example: (real omegaframe = OMEGAFRAME). INTERNAL: ......... **Identifier** :: ... <\INTERNAL> **Parsed as**:: Textually **Location**:: Normally, after the EXTERNAL block. All the internal variables are work variables, with a very local scope. You find here the indices of the loops, but you could include any other variable that you need. Maybe, one of the most interesting examples on how to use this block is found in ``compute_force.c``. CONSTANT: ......... **Identifier** :: ... <\CONSTANT> **Parsed as**:: The variables inside are moved to the constant memory if BIGMEM is not defined. If BIGMEM is defined, the variables are moved instead to the device's global memory. Example: // real ymin(Ny+2*NGHY+1); <\CONSTANT> Is parsed as: #define ymin(i) ymin_s[(i)] ... CONSTANT(real, ymin_s, some_number); Here, some_number is a fraction of the constant memory size, calculated by the parser. ... #ifdef BIGMEM #define ymin_d &Ymin_d #endif CUDAMEMCPY(ymin_s, ymin_d, sizeof(real)*(Ny+2*NGHY+1), 0, cudaMemcpyDeviceToDevice); ... **Location**:: Normally, after the INTERNAL block. This is one of the most complex blocks. It is very complex because the process is complex. We need a way to pass our light arrays to the constant memory of the GPU (for performance reasons). But there are some cases where the problem is too big and the constant memory is not enough. In this case, one can define the build flag BIGMEM. The main portrait of the process is: Declare as constant some global variables you want to use without passing it as external, or use this block for light arrays, similar to ``xmin``, ``sxk``, etc. This block reserves a constant memory segment and copy the data to this segment. If BIGMEM is activated, the constant memory is not used but instead, the global memory is used. CUDAMEMCPY macrocommand is expanded in a manner that depends on whether BIGMEM is defined and therefore performs the copy to the correct location (constant or global memory). You can see this in ``src/define.h``. Note that there is a subtlety here: in case you use BIGMEM, the constant memory is still used to store the pointer to the array. In the other case, it stores directly the array itself. The variables for this block are created in ``src/light_global_dev.c``. The scalar variables are passed as:: // type variable(1) While the array variables are passed as:: // type variable(size_of_the_array) Note that all this block is commented out in the C file, with ``//`` at the beginning of each line. MAIN LOOP: .......... **Identifier** :: ... <\MAIN_LOOP> **Parsed as**:: The size of the loops is read and parsed. Also, the indices are initialized: Example: #ifdef Z for (k=NGHZ; k=NGHZ && k=NGHY && j <\#>". **Location**:: Before the initialization of the indices i,j,k. This block is very particular because it must to be closed after the close of the outermost loop. Remember that the content of the main loop block, defined by ``<#>..<\#>``, nested within the MAIN_LOOP block, is parsed textually, so you cannot use global variables inside it or the generated CUDA code will not work. LOOP CONTAIN: ............. **Identifier** :: <#> ... <\#> **Parsed as**:: Textually. **Location**:: After the innermost loop or where you want to have a textual parsing inside the main_loop. If you put the beginning of this block not exactly after the innermost loop, you can make some interesting things. With this technique, you can skip some lines devoted to the CPU. You can also achieve this with the ``_GPU`` flag declared at the beginning, as shown earlier. LAST_BLOCK .......... **Identifier** :: ... <\LAST_BLOCK> **Parsed as**:: Textually. **Location**:: The last block in the routine. This block will be executed after the kernel execution. Useful for reductions, for instance. Common errors: -------------- This section describes the most common errors at compilation time using the parser: * The name of a block has white spaces: all blocks declarations must be closed and after that white spaces are not allowed:: __ --> wrong --> ok * The pointer type is not a type: For the parser, the type of a variable is a string without spaces. A pointer variables must be declared as the pointer type:: real *rho --> wrong real* rho --> ok (the type is real*) * Some block was not properly closed: if you have not closed a block, undefined behavior may result:: --> wrong --> wrong <\BLOCK> --> ok * The end of the name function is not _cpu: The parser cannot find the function name if it does not have a valid name, neither can it invent a rule to make the wrapper and kernel names. * The order of the arguments in the function: Remember, pointers to Field structures are at the end. * Only one variable per line may appear in the INTERNAL and EXTERNAL fields. * Files which have been edited on a Windows machine, and in which lines end with '\r\n' instead of ending with '\n', will fail to be converted to CUDA. Use a conversion procedure such as ``tr -d '\d' < original_file.c > correct_line_ending.c`` prior to the build. * Values relative to the mesh (such as ``zmin`` or ``xmed``, etc.) should be lowercase and should be followed by parentheses, not square brackets, because they are considered as macrocommands. You can use ``substep1_x.c`` as a template for that. This list may be completed as we receive users' feedback.