ROSE Compiler Framework/Arithmetic intensity measuring tool

ROSE tools>

Overview
A tool to help measure arithmetic intensity (FLOPS/Memory) of loops. It does so by
 * statically estimating floating point operations and load/store bytes per iteration for user-specified loops
 * instrumenting the loops with statements to capture loop iteration counts and calculate FLOPS and memory footprints (load/store bytes)
 * users then run the instrumented code to generate the final reports.

Quick information
 * tool location: https://github.com/rose-compiler/rose-develop/tree/master/projects/ArithmeticMeasureTool
 * testing: type "make check" within the corresponding build tree

Download and Installation
It is recommended to obtain the tool from rose-develop repo to have the latest update.
 * https://github.com/rose-compiler/rose-develop

The first step is to download and install rose as usual Then
 * Latest instructions: http://rosecompiler.org/ROSE_HTML_Reference/installation.html
 * cd rose-build-tree/projects/ArithmeticMeasureTool
 * make && make install

An executable file named measureTool will then be installed within ROSE_INSTALLATION_PATH/bin

Now prepare your environment so the tool can be invoked

ROSE_INS=/home/liao6/workspace/masterDevClean/install export ROSE_INS
 * 1) set.rose file,  source it to set up environment variables

PATH=$ROSE_INS/bin:$PATH export PATH

LD_LIBRARY_PATH=$ROSE_INS/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH

Command line options
List
 * -help: print out help information
 * -debug: enable debugging mode, generating screen output showing progress and internal results
 * -annot your_annotation_file: accept user specified function side effect annotations, complement compiler analysis
 * -static-counting-only : a special execution mode in which the tool scans all loop bodies and write counting results into a report file
 * -report-file your_report_file.txt : specify your own report file name, otherwise the default file ai_tool_report.txt is used.
 * -use-algorithm-v2: using 2nd version algorithm in the static counting mode, bottomup synthesized traversal to count FLOPS, still under development

Function side effect annotation
Compiler analysis cannot figure out side effect of all functions. This can be caused by no access to the library source code or complexity of pointer uses in the source code. To solve this problem, the tool accepts function side effect annotation file via an option --annot

Annotation file format operator abs(int val) {   modify none; read{val}; alias none; } operator max(double val1, double val2) {   modify none; read{val1, val2}; alias none; }

example command line
 * measureTool -c -annot /path/to/functionSideEffect.annot your_input.c

Execution mode 1: static analysis only
This is a special mode of the tool to only find all loops and count FLOPs for loop bodies. The reported numbers are for single iteration only.

The load/store bytes are represented in two ways
 * expression format: such as 3*sizeof(float) + 5*sizeof(double)
 * evaluated final integer values: 52

The result is written to a text report file.

Example use
./measureTool -c -static-counting-only -annot ../../../sourcetree/projects/ArithmeticMeasureTool/src/functionSideEffect.annot -I. ../../../sourcetree/projects/ArithmeticMeasureTool/test/jacobi.c

Excerpt of the generated report. Note that a loop at line 129 has two Plus FP operations and 2 multiplication operations. It loads 0 bytes and store one double element (8 bytes usually). So the final arithmetic intensity (AI) is 4/8= 0.5 ops/byte

Content of generated report file: ai_tool_report.txt

--Floating Point Operation Counts-

SgForStatement@ /home/liao6/workspace/ExReDi/ai_tool/sourcetree/projects/ArithmeticMeasureTool/test/jacobi.c:129:10 fp_plus:2 fp_minus:0 fp_multiply:2 fp_divide:0 fp_total:4 --Memory Operation Counts-

Loads: NULL Loads int: 0 Stores:1 * sizeof(double ) Store int: 8 --Arithmetic Intensity- AI=0.5

Right now
 * AI is set to -1.0 if it is unintialized
 * AI is set to be 9999.9 if divided by zero bytes

User pragma to verify results
In this mode, the translator can verify the tool-generated results by comparing the results to what is indicated by pragmas in the input code.

The user provided pragma has the form of for ...
 * 1) pragma aitool fp_plus(10) fp_minus(10) fp_multiply(10) fp_divide (10) fp_total(40)

void error_check { int i,j; double xx,yy,temp,error;

dx = 2.0 / (n-1); dy = 2.0 / (m-1); error = 0.0 ;

for (i=0;i<n;i++) for (j=0;j<m;j++) {     xx = -1.0 + dx * (i-1); yy = -1.0 + dy * (j-1); temp = u[i][j] - (1.0-xx*xx)*(1.0-yy*yy); error = error + temp*temp; } error = sqrt(error)/(n*m); printf("Solution Error :%E \n",error); }
 * 1) pragma aitool fp_plus(3) fp_minus(3) fp_multiply(6)

fp_total is required while the clauses of other kinds of FP operations are optional.

Execution mode 2: analyze and instrument the code
This is the default mode.

Manual instrument the input code
The tool currently works with collaboration with user-added code instrumentation, using the following steps:
 * declare four global counters with specific variable names, which will later be recognized by the tool
 * add chiterations = .. before the loops you want to count FPs and Load/store bytes
 * print out the results: printf ("chflops =%lu chloads =%lu chstores=%lu\n", chflops, chloads, chstores);

1 #include  2 #define SIZE 10 3  4 // Instrumentation 1: add a few global variables 5 unsigned long int chiterations = 0; 6 unsigned long int chloads = 0; 7 unsigned long int chstores = 0; 8 unsigned long int chflops = 0; 9 10 double ref[2] = {9.2, 5.4}; 11 double coarse[SIZE][SIZE][SIZE]; 12 int main 13 { 14   double refScale = 1.0 / (ref[0] * ref[1]); 15  int iboxlo1 = 0, iboxlo0 = 0, iboxhi1 = SIZE-1, iboxhi0 = SIZE-1; 16  int var; 17  int ic1=0, ic0=0; 18  int ip0 = ic0 * ref[0]; 19  int ip1 = ic1 * ref[1]; 20  double coarseSum = 0.0; 21  int ii1, ii0; 22   23   for (var =0; var < SIZE ; var++) 24  {  25     //Instrumentation 2: pass in loop iteration for the loop to be counted 26    chiterations = (1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0); 27    for (ic1 = iboxlo1; ic1< iboxhi1 +1; ic1++) 28      for (ic0 = iboxlo0; ic0< iboxhi0 +1; ic0++) 29      {  30         int ibreflo1 = 0, ibreflo0 = 0, ibrefhi1 = SIZE-1, ibrefhi0 = SIZE-1; 31        //Instrumentation 3: pass in loop iteration for the loop to be counted 32        chiterations = (1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0); 33        for (ii1 = ibreflo1; ii1< ibrefhi1 +1; ii1++) 34          for (ii0 = ibreflo0; ii0< ibrefhi0 +1; ii0++) 35          { 36             coarseSum = coarseSum +  coarse[ii1][ii0][ii1] +(ip0 + ii0) + (ip1 + ii1)  + var; 37          }  38         coarse[ic0][ic1][var] = coarseSum * refScale; 39      }  40   }    41   //Instrumentation 4: print out results 42  printf ("chflops =%lu chloads =%lu chstores=%lu\n", chflops, chloads, chstores); 43  return 0; 44 }

Use the tool to transform the code
./measureTool -c -annot ../../../sourcetree/projects/ArithmeticMeasureTool/src/functionSideEffect.annot nestedloops.c

The tool will
 * count the FLOPs and load store bytes for the specified loop
 * add counter accumulation statements, using different counters for different loops

1 #include  2 #define SIZE 10 3 // Instrumentation 1: add a few global variables 4 unsigned long chiterations = 0; 5 unsigned long chloads = 0; 6 unsigned long chstores = 0; 7 unsigned long chflops = 0; 8 double ref[2] = {(9.2), (5.4)}; 9 double coarse[10][10][10]; 10 11 int main 12 { 13  double refScale = 1.0 / (ref[0] * ref[1]); 14  int iboxlo1 = 0; 15  int iboxlo0 = 0; 16  int iboxhi1 = 10 - 1; 17  int iboxhi0 = 10 - 1; 18  int var; 19  int ic1 = 0; 20  int ic0 = 0; 21  int ip0 = (ic0 * ref[0]); 22  int ip1 = (ic1 * ref[1]); 23  double coarseSum = 0.0; 24  int ii1; 25  int ii0; 26  unsigned long chiterations_1; 27  unsigned long chiterations_2; 28  for (var = 0; var < 10; var++) { 29 //Instrumentation 2: pass in loop iteration for the loop to be counted 30    chiterations_2 = (1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0); 31    for (ic1 = iboxlo1; ic1 < iboxhi1 + 1; ic1++) { 32      for (ic0 = iboxlo0; ic0 < iboxhi0 + 1; ic0++) { 33        int ibreflo1 = 0; 34        int ibreflo0 = 0; 35        int ibrefhi1 = 10 - 1; 36        int ibrefhi0 = 10 - 1; 37 //Instrumentation 3: pass in loop iteration for the loop to be counted 38        chiterations_1 = (1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0); 39        for (ii1 = ibreflo1; ii1 < ibrefhi1 + 1; ii1++) { 40          for (ii0 = ibreflo0; ii0 < ibrefhi0 + 1; ii0++) { 41            coarseSum = coarseSum + coarse[ii1][ii0][ii1] + (ip0 + ii0) + (ip1 + ii1) + var; 42          } 43         } 44 /*       aitool generated Loads counting statement ... */ 45        chloads = chloads + chiterations_1 * (1 * sizeof(double )); 46 /*      aitool generated FLOPS counting statement ... */ 47        chflops = chflops + chiterations_1 * 4; 48        coarse[ic0][ic1][var] = coarseSum * refScale; 49      } 50     } 51 /*       aitool generated Stores counting statement ... */ 52    chstores = chstores + chiterations_2 * (1 * sizeof(double )); 53 /*      aitool generated FLOPS counting statement ... */ 54    chflops = chflops + chiterations_2 * 1; 55  } 56 //Instrumentation 4: pass in loop iteration for the loop to be counted 57  printf("chflops =%lu chloads =%lu chstores=%lu\n",chflops,chloads,chstores); 58  return 0; 59 }

Compile& run the transformed code
gcc -O3 rose_nestedloops.c -o nestedloops.out -l

./nestedloops.out

The result looks like

chflops =401000 chloads =800000 chstores=8000

Limitations
The tool does not support Fortran loops with function calls for now
 * ROSE's Fortran procedure/routine representation is not accurate enough (missing parameter type info.) to hook up with function side effect annotations designed to match C/C++ functions.

Internals
Execution model variable running_mode
 * e_analysis_and_instrument
 * e_static_counting

FP operations
class FPCounters: public AstAttribute {}; to store analysis results

void CountFPOperations from src/ai_measurement.cpp

Rose_STL_Container nodeList = NodeQuery::querySubTree(input, V_SgBinaryOp); for (Rose_STL_Container::iterator i = nodeList.begin; i != nodeList.end; i++) {     fp_operation_kind_enum op_kind = e_unknown; //     bool isFPType = false; // check operation type SgBinaryOp* bop= isSgBinaryOp(*i); switch (bop->variantT) {       case V_SgAddOp: case V_SgPlusAssignOp: op_kind = e_plus; break; case V_SgSubtractOp: case V_SgMinusAssignOp: op_kind = e_minus; break; case V_SgMultiplyOp: case V_SgMultAssignOp: op_kind = e_multiply; break; case V_SgDivideOp: case V_SgDivAssignOp: op_kind = e_divide; break; default: break; } //end switch ...

}

Load/Store bytes
The main functions are defined in ai_measurement.cpp: return expressions to calculate the value, not the actual values, since sizeof(type) is machine dependent.
 * std::pair  CountLoadStoreBytes (SgLocatedNode* input, bool includeScalars /* = true */, bool includeIntType /* = true */)
 * SgExpression* calculateBytes (std::set& name_set, SgStatement* lbody, bool isRead)

Configuration
 * By default: only array references are counted. Scalars are ignored.

Algorithm
 * call side effect analysis to find read/write variables, some reference may trigger both read and write accesses. If analysis is successful, proceed. Otherwise warning is sent.
 * Accesses to the same array/scalar variable are grouped into one read (or write) access: e.g. array[i][j], array[i][j+1],  array[i][j-1], etc are counted a single access
 * Group accesses based on the types: same type access-> increment the same counter to shorten expression length
 * Iterate on the results to generate expression like 2*sizeof(float) + 5* sizeof(double)
 * As an approximate, we use simple analysis here assuming no function calls.

// Obtain per-iteration load/store bytes calculation expressions // excluding scalar types to match the manual version //CountLoadStoreBytes (SgLocatedNode* input, bool includeScalars = true, bool includeIntType = true); std::pair  load_store_count_pair = CountLoadStoreBytes (loop_body, false, true); // chstores=chstores+chiterations*8 if (load_store_count_pair.second!= NULL) {                                                                                                                               SgExprStatement* store_byte_stmt = buildCounterAccumulationStmt("chstores", new_iter_var_name, load_store_count_pair.second, scope); insertStatementAfter (loop, store_byte_stmt); attachComment(store_byte_stmt,"     aitool generated Stores counting statement ..."); }                                                                                                                             // handle loads stmt 2nd so it can be inserted as the first after the loop // build chloads=chloads+chiterations*2*8 if (load_store_count_pair.first != NULL) {                                                                                                                               SgExprStatement* load_byte_stmt = buildCounterAccumulationStmt("chloads", new_iter_var_name, load_store_count_pair.first, scope); insertStatementAfter (loop, load_byte_stmt); attachComment(load_byte_stmt,"     aitool generated Loads counting statement ..."); }

Nested loops
Scientific applications usually have nested loops. Naive instrumentation will cause two problems
 * double counting for nested loop body:
 * the chiterations= .. statement is used for all levels of loop. The inner loop's chiterations will overwrite the chiterations used to indicate outer loop.

Solutions
 * The translator uses a bottom-up traversal order: processing inner loops first, then outer loops.
 * To avoid double counting FP operations within nested loops: all visited FP operations expressions are stored into a lookup table. Later counting will check if an operation is already counted. If so, skip it.
 * To avoid double counting of variables used in nested loops when counting a outer loop body: This is slightly different from the handling of FP op expressions. Here we find all variables counted in inner loops, exclude them when do the counting for an outer loop. Note: excluding a entirely, not just flagging a reference to a and exclude such reference later.
 * Note: static counting mode does not do this excluding since the assumption of redundant execution is no longer a concern. We still count loop body's FLOPS for inner and outer loops if they are nested.
 * rewrite chiterations= to chiterations_loopId= .., so each loop has its own iteration number variable.

// global chiterations is changed to two local variables: each for one loop unsigned long chiterations_1; unsigned long chiterations_2; for (var = 0; var < 10; var++) { //Instrumentation 2: pass in loop iteration for the loop to be counted chiterations_2 = ((1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0) * 1); for (ic1 = iboxlo1; ic1 < iboxhi1 + 1; ic1++) { for (ic0 = iboxlo0; ic0 < iboxhi0 + 1; ic0++) { int ibreflo1 = 0; int ibreflo0 = 0; int ibrefhi1 = 10 - 1; int ibrefhi0 = 10 - 1; //Instrumentation 3: pass in loop iteration for the loop to be counted chiterations_1 = ((1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0) * 1); for (ii1 = ibreflo1; ii1 < ibrefhi1 + 1; ii1++) { for (ii0 = ibreflo0; ii0 < ibrefhi0 + 1; ii0++) { coarseSum = coarseSum + coarse[ii1][ii0][ii1] + (ip0 + ii0) + (ip1 + ii1) + var; }       } /*       aitool generated Loads counting statement ... */       chloads = chloads + chiterations_1 * (1 * sizeof(double )); /*      aitool generated FLOPS counting statement ... */       chflops = chflops + chiterations_1 * 4; coarse[ic0][ic1][var] = coarseSum * refScale; }   } /*       aitool generated Stores counting statement ... */   chstores = chstores + chiterations_2 * (1 * sizeof(double )); /*      aitool generated FLOPS counting statement ... */   chflops = chflops + chiterations_2 * 1; }

Testing
run all builtin tests
 * make check

run tests for static analysis only
 * make check-static

Manual testing


 * [liao6@tux322:~/workspace/ExReDi/ai_tool.git/translator]m && ./measureTool -c -annot ./src/functionSideEffect.annot -I. ./test/jacobi-v3.c