Methodology for code migration on many-core architecture

This wikibook describes a step-by-step methodology to port legacy code on Many-Core Architectures.

This methodology is used by High-performance computing (HPC) actors as part of HMPP Competence Centers. HMPP Competence Centers gather partners to address the many-core programming challenge at the technological (parallel programming, code tuning, etc.) and application levels.

Porting code to many-core systems
Porting code to many-core systems is a complex operation that requires many skills to be consolidated in order to achieve the plan results for a planned effort. From the computer science point of view, porting applications to a many-core target consists of providing an equivalent program that runs faster by exploiting parallelism at the hardware level. The goal is to improve the performance, without necessarily using all hardware components. If a serial code-based solution is the best, it should be considered.

There are mainly two interleaved dimensions in migrating an application:
 * 1) Achieving high performance;
 * 2) Keeping the code readable/maintainable from the application developers’ point of view. It is a major constraint that the migrated code remains understandable to the owner of the code as well as easy to maintain.

In most cases, the starting point is a sequential legacy program. Its migration is necessary mainly because no automatic process can convert sequential code to a massive parallel version that exploits a large number of cores. Because the rapid evolution of the processor landscape makes software development more complex and usual parallel programming strategies will have to be modified to adapt existing applications in order to take advantage of these new processors.

Legacy Code Methodology
Besides having a clear view of how to migrate applications onto new Many-core processors (GPU right now), the main objective of a methodology is to reduce risks and to improve efficiency. It is indeed not economically viable to start a project and to realize a few months later after having spent engineering resources and money that the project cannot succeed.

The code migration process defined here is nothing more than a common sense approach of segmenting a development cycle into steps with associated durations. Each step indicates what tools to use and produces a go/no go decision before starting the next phase.

The three steps are:
 * 1) Parallel project definition: in this step a diagnosis of the application is performed in order to evaluate potential speedup and to determine the main porting operations associated to cost. As a prerequisite, a validation process is set up to ensure the validity of the numerical result.
 * 2) Application porting: in a few weeks, a first functional GPU version of the code is developed and a GPU execution profile is performed to identify bottlenecks in order to improve code efficiency during the following step.
 * 3) Application optimization: bottlenecks are analyzed and code optimization is performed to get a production code finely tuned. As risks of not succeeding have been raised, this step can last longer than previous ones.

The first two steps are part of the initial phase aiming at exhibiting heterogeneous parallelism. They are performed by programmers who have an intimate knowledge of the application algorithm and computing methods. The third step, the second phase, requires more skills in code tuning techniques.

These steps are defined with a control of the cost in mind. As the migration process goes on, the risk of failure decreases and more manpower can be spent on the final operations. The migration methodology is really oriented toward a “best effort” approach for a given period of time.



Tools presented in figures are examples and can be completed thanks to the references.

Step 1: Parallel Project Definition
The top part in Figure 1 details the steps to perform in order to analyze the code and to define the main migration operations:
 * Hotspot identification: using profiling tools, this first phase aims at finding the critical hotspots that might benefit from GPU acceleration. Code rewriting might be necessary to increase data parallelism.
 * CPU analysis: CPU analysis is required to ensure that the original code is sufficiently optimized to serve as a fair performance comparison base. Tuning the CPU code also usually lead to an efficient migration starting point.
 * Parallelism discovery: this step ensures that the kernels can be executed in parallel. If this is not the case, the accelerator will not be able to achieve a performance gain. Algorithms should then be reconsidered to exhibit parallelism.



Step 2: Application Porting
Bottom box of Figure 2 gives the steps to develop and build a first functional GPU version of an application. They mainly consist in generating and calling the GPU kernels by annotating the previously identified hotspots with HMPP directives.

These steps are performed incrementally: kernels are ported and validated one by one; their performance is evaluated in regard to the original CPU performance so as to check whether they are suitable to GPU execution; basic code transformations as advised by the HMPP Wizard are applied to the kernel computations in order to make them GPU friendly; some data transfers are basically optimized so as to preload data before codelet execution and to suppress redundant transfers of constant data. This preliminary porting version serves as identifying GPGPU issues as well as validating the parallel properties of the implementation. By enabling to track changes, the incremental way of doing allows to more easily find and correct bugs, at worse to reverse back. It is indeed very easy to lose its way when doing too many transformations at once.

Step 3: Application Optimization


In this last step, the whole hybrid application is optimized by further reducing data transfers, by fine tuning GPU kernel performance and by moving GPU device allocation at application startup.

Hardware/Software environment

 * What are the target architectures and operating systems?
 * What are the constraints on compilers, libraries, software licenses (e.g. GPL)?

Application code

 * Are all codes and necessary libraries available on the target machine(s)?
 * Are there representative input data sets available?
 * Does the parallel execution results need to be bitwise equivalent compared to the sequential ones?
 * Is the procedure to validate the execution results defined (taking into account changes in floating point rounding)?
 * Is there a reference person able to answer questions about the application code and algorithm?
 * Are the performance goals (and on which execution profiles) clearly defined?
 * Is there a functional description and documentation of the code available?
 * What kinds of production runs are usually performed (e.g. throughput mode, under deadline, etc.)?

Books and Papers

 * Banerjee, U., Bliss. B., Ma, Z., and Petersen, P., “Unraveling Data Race Detection in the Intel® Thread Checker,” presented at the First Workshop on Software Tools for Multi-core Systems (STMCS), in conjunction with IEEE/ACM International Symposium on Code Generation and Optimization (CGO), March 26, 2006, Manhattan, New York, NY.
 * D.F. Bacon, S.L. Graham, O.J. Sharp, Compiler Transformations for High-Performance Computing, "ACM Computing Surveys", December 1994, vol. 26, no 4, pp 345-420
 * David Blair Kirk, Wen-mei W. Hwu: Programming Massively Parallel Processors - A Hands-on Approach. Morgan Kaufmann 2010: I-XVIII, 1-258
 * F. Bodin, S. Bihan, “Heterogeneous Multicore Parallel Programming for Graphics Processing Units “, the Scientific Programming Journal, Volume 17, Number 4 / 2009.
 * G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Computational Science Series, 2010. ISBN 978-1439811924
 * Herlihy, M. and Shavit, N., The Art of Multiprocessor Programming, Morgan Kaufmann, 2008.
 * John L. Hennessy and David A. Patterson. 2003. Computer Architecture; a Quantitative Approach (3rd ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
 * Kennedy, K. and Allen, J. R. 2002 Optimizing Compilers for Modern Architectures: a Dependence-Based Approach. Morgan Kaufmann Publishers Inc.
 * Performance Tuning of Scientific Applications, David H. Bailey, Robert F. Lucas, Samuel Williams
 * S. Akhter: Multicore Programming: Increasing Performance Through Software Multi-threading. Intel Press, 2006. ISBN 978-0976483243
 * Timothy Mattson, Beverly Sanders, and Berna Massingill. 2004. Patterns for Parallel Programming (First ed.). Addison-Wesley Professional.
 * U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, Mass., 1988.

Online Resources
There are many resources available on-line, here are a few one.


 * ACM Parallel Computing Tech Pack: http://techpack.acm.org/parallel/JourneymanTour.pdf
 * Allinea DDT: http://www.allinea.com/products/ddt/
 * Allinea OPT: http://www.allinea.com/products/opt/
 * Automatic parallelization from compilers such as Intel compilers (http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/), PGI compiler (http://www.pgroup.com/products/pgicdk.htm), PathScale compiler (http://www.pathscale.com/pdf/QuickReference.pdf), etc.
 * Automatically Tuned Linear Algebra Software : http://math-atlas.sourceforge.net/
 * Designing and Building Parallel Programs: http://www.mcs.anl.gov/~itf/dbpp/
 * Discrete Fourier transform: http://www.fftw.org/
 * GPGPU.org is a central resource for GPGPU news and information: http://gpgpu.org
 * Gprof: the GNU Profiler. http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html.
 * GPUCV (GPU-accelerated image processing): https://picoforge.int-evry.fr/cgi-bin/twiki/view/Gpucv/Web/
 * HMPP Workbench: http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36
 * HpcToolkit: http://hpctoolkit.org/
 * http://bebop.cs.berkeley.edu/oski/
 * http://developer.amd.com/gpu/acmlgpu/Pages/default.aspx
 * http://developer.amd.com/zones/openclzone/pages/default.aspx
 * http://developer.amd.com/zones/OpenCLZone/pages/toolsandlibraries.aspx
 * http://en.wikipedia.org/wiki/Automatic_parallelization
 * http://en.wikipedia.org/wiki/Data_dependency
 * http://en.wikipedia.org/wiki/HMPP_Open_Standard
 * http://en.wikipedia.org/wiki/Loop_nest_optimization
 * http://en.wikipedia.org/wiki/Parallel_computing
 * http://golem5.org/gatlas/
 * http://icl.cs.utk.edu/magma/
 * http://math.nist.gov/sparselib++/
 * http://openmp.org/wp/
 * http://software.intel.com/en-us/articles/intel-mkl/
 * http://support.amd.com/us/Processor_TechDocs/40546.pdf
 * http://www-users.cs.umn.edu/~karypis/parbook/
 * http://www.akkadia.org/drepper/cpumemory.pdf
 * http://www.cs.berkeley.edu/~volkov/volkov09-optimizing.pdf
 * http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
 * http://www.culatools.com/
 * http://www.hipeac.net/system/files/NemaLabs_0.pdf
 * http://www.khronos.org/opencl
 * http://www.nas.nasa.gov/Resources/Software/npb.html
 * http://www.nemalabs.com/
 * http://www.netlib.org/lapack/
 * http://www.NVIDIA.com/content/GTC/documents/1418_GTC09.pdf
 * http://www.pathscale.com/pdf/PathScale-ENZO-1.0-UserGuide.pdf
 * http://www.vi-hps.org/
 * http://www.vi-hps.org/training/
 * https://computing.llnl.gov/tutorials/openMP/
 * https://computing.llnl.gov/tutorials/parallel_comp/
 * Intel IPP: http://www.intel.com/software/products/ipp
 * Introduction to parallel programming: https://computing.llnl.gov/tutorials/parallel_comp/
 * Linux “time” command: https://computing.llnl.gov/tutorials/performance_tools/#time
 * Multicore association: http://www.multicore-association.org/workgroup/mpp.php
 * NVIDIA CUDA available: http://developer.NVIDIA.com/object/cuda home.html.
 * NVIDIA CUDA: http://developer.NVIDIA.com/object/cuda home.html
 * NVIDIA NSight: http://developer.NVIDIA.com/NVIDIA-parallel-nsight
 * Opari: http://www.fz-juelich.de/zam/kojak/opari
 * OpenCL best practice: http://www.NVIDIA.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf
 * OpenCV (Open Source Computer Vision) : http://opencv.willowgarage.com/wiki/
 * OpenHMPP: http://www.openhmpp.org/
 * Oprofile: http://oprofile.sourceforge.net/
 * PAPI (Performance API): http://icl.cs.utk.edu/papi/
 * Paraver: http://www.bsc.es/plantillaA.php?cat_id=485
 * Performance Analysis Tools: https://computing.llnl.gov/tutorials/performance_tools/
 * Rogue Wave TotalView: http://www.roguewave.com/products/totalview-family/totalview.aspx
 * TAU: http://tau.uoregon.edu
 * Valgrind: http://valgrind.org/
 * Vampire: http://www.vampir.eu