Merge pull request #17 from ecrc/acharara/batch-triangular

v2.0.0
ecrc · Nov 15, 2017 · f3b5902 · f3b5902
2 parents 86032bf + c4edf09
commit f3b5902
Show file tree

Hide file tree

Showing 238 changed files with 39,118 additions and 9,487 deletions.
diff --git a/Doxyfile b/Doxyfile
diff --git a/INSTALL b/INSTALL
@@ -0,0 +1,35 @@
+KBLAS installation requires a recent **make**.
+To build KBLAS, please follow these instructions:
+
+1.  Get KBLAS from git repository
+
+        git clone git@github.com:ecrc/kblas-gpu
+
+    or
+
+        git clone https://github.com/ecrc/kblas-gpu
+
+2.  Go into KBLAS folder
+
+        cd kblas-gpu
+
+3.  Edit file make.inc to:
+    - Enable / disable KBLAS sub modules (_SUPPORT_BLAS2_, _SUPPORT_BLAS3_, _SUPPORT_BATCH_TR_, _SUPPORT_SVD_).
+    - Enable / disable usage of third party libraries (_USE_MKL_, _USE_MAGMA_) for performance comparisons.
+    - Provide path for third party libraries if required (_CUB_DIR_, _MAGMA_ROOT_).
+    - Specify CUDA architecture to compile for (_CUDA_ARCH_).
+
+    or
+
+    Provide equivalent environment variables.
+
+4.  Build KBLAS
+
+        make
+
+5.  Build local documentation (optional)
+
+        make docs
+
+KBLAS library will be built in the folder ./lib
+Enjoy.
diff --git a/Jenkinsfile b/Jenkinsfile
@@ -22,51 +22,96 @@ pipeline {
     }
 
     stages {
-        stage ('cuda-7.0') {
+        stage ('cuda-8.0') {
             steps {
                 sh '''#!/bin/bash -le
-                    module load gcc/4.8.5 cuda/7.0; make clean; make all
+                    module load gcc/4.8.5;
+                    module load cuda/8.0
+                    module load intel/16
+                    module list
+                    set -x
+                    export _MAGMA_ROOT_=/opt/ecrc/magma/2.2.0-intel-16-mkl-cuda-8.0
+                    export _CUB_DIR_=$PWD/cub
+                    if [ -d cub ]
+                    then
+                        cd cub; git pull; cd ..
+                    else
+                        git clone https://github.com/NVLABS/cub cub
+                    fi
+                    make clean
+                    make
                     export CUDA_VISIBLE_DEVICES=2; export NGPUS=1
-                    sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./kblas-test-l2.sh
-                    sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./kblas-test-l2.sh
-                    ./kblas-test-l2.sh
-                    sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./kblas-test-l3.py
-                    sed -i "/ranges = /,/\\]/d" ./kblas-test-l3.py
-                    ./kblas-test-l3.py
-                    sed -i "/--range 2048:15360:1024/d" ./kblas-test-l3.py
-                    ./kblas-test-l3.py
+                    sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./test-scripts/kblas-test-l2.sh
+                    sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./test-scripts/kblas-test-l2.sh
+                    ./test-scripts/kblas-test-l2.sh
+                    sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./test-scripts/kblas-test-l3.py
+                    sed -i "/ranges = /,/\\]/d" ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-l3.py
+                    sed -i "/--range 2048:15360:1024/d" ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-batch-parallel.py
                 '''
             }
         }
         stage ('cuda-7.5') {
             steps {
                 sh '''#!/bin/bash -le
-                    module load gcc/4.8.5 cuda/7.5; make clean; make all
+                    module load gcc/4.8.5;
+                    module load cuda/7.5
+                    module load intel/16
+                    module list
+                    set -x
+                    export _MAGMA_ROOT_=/opt/ecrc/magma/2.2.0-intel-16-mkl-cuda-7.5
+                    export _CUB_DIR_=$PWD/cub
+                    if [ -d cub ]
+                    then
+                        cd cub; git pull; cd ..
+                    else
+                        git clone https://github.com/NVLABS/cub cub
+                    fi
+                    make clean
+                    make
                     export CUDA_VISIBLE_DEVICES=2; export NGPUS=1
-                    sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./kblas-test-l2.sh
-                    sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./kblas-test-l2.sh
-                    ./kblas-test-l2.sh
-                    sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./kblas-test-l3.py
-                    sed -i "/ranges = /,/\\]/d" ./kblas-test-l3.py
-                    ./kblas-test-l3.py
-                    sed -i "/--range 2048:15360:1024/d" ./kblas-test-l3.py
-                    ./kblas-test-l3.py
+                    sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./test-scripts/kblas-test-l2.sh
+                    sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./test-scripts/kblas-test-l2.sh
+                    ./test-scripts/kblas-test-l2.sh
+                    sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./test-scripts/kblas-test-l3.py
+                    sed -i "/ranges = /,/\\]/d" ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-l3.py
+                    sed -i "/--range 2048:15360:1024/d" ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-batch-parallel.py
                 '''
             }
         }
-        stage ('cuda-8.0') {
+        stage ('cuda-7.0') {
             steps {
                 sh '''#!/bin/bash -le
-                    module load gcc/4.8.5 cuda/8.0; make clean; make all
+                    module load gcc/4.8.5;
+                    module load cuda/7.0
+                    module load intel/16
+                    module list
+                    set -x
+                    export _MAGMA_ROOT_=/opt/ecrc/magma/2.0.1-intel-16-mkl-cuda-7.0/
+                    export _CUB_DIR_=$PWD/cub
+                    if [ -d cub ]
+                    then
+                        cd cub; git pull; cd ..
+                    else
+                        git clone https://github.com/NVLABS/cub cub
+                    fi
+                    make clean
+                    make
                     export CUDA_VISIBLE_DEVICES=2; export NGPUS=1
-                    sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./kblas-test-l2.sh
-                    sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./kblas-test-l2.sh
-                    ./kblas-test-l2.sh
-                    sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./kblas-test-l3.py
-                    sed -i "/ranges = /,/\\]/d" ./kblas-test-l3.py
-                    ./kblas-test-l3.py
-                    sed -i "/--range 2048:15360:1024/d" ./kblas-test-l3.py
-                    ./kblas-test-l3.py
+                    sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./test-scripts/kblas-test-l2.sh
+                    sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./test-scripts/kblas-test-l2.sh
+                    ./test-scripts/kblas-test-l2.sh
+                    sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./test-scripts/kblas-test-l3.py
+                    sed -i "/ranges = /,/\\]/d" ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-l3.py
+                    sed -i "/--range 2048:15360:1024/d" ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-l3.py
+                    ./test-scripts/kblas-test-batch-parallel.py
                 '''
             }
         }

diff --git a/LICENSE b/LICENSE
@@ -1,4 +1,4 @@
-Copyright (c) 2016, Extreme Computing Research Center
+Copyright (c) 2012-, King Abdullah University of Science and Technology
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
@@ -11,7 +11,7 @@ modification, are permitted provided that the following conditions are met:
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution.
 
-* Neither the name of KBLAS-GPU nor the names of its
+* Neither the name of the copyright holder nor the names of its
   contributors may be used to endorse or promote products derived from
   this software without specific prior written permission.
 

diff --git a/Makefile b/Makefile
@@ -3,7 +3,7 @@
 all:
 	(cd src && make -j)
 	(cd testing && make -j)
-	
+
 clean:
 	rm -f -v ./lib/*.a
 	(cd src && make clean)

diff --git a/README.md b/README.md
@@ -1,15 +1,107 @@
 # kblas-gpu
-=========================
-KBLAS README FILE
-=========================
 
-KBLAS is an optimized library for a subset of Basic Linear Algebra Subroutines (BLAS) on NVIDIA GPUs.
+What is KBLAS
+=============
+
+KAUST BLAS (KBLAS) is a high performance CUDA library implementing a subset of BLAS as well as Linear Algebra PACKage (LAPACK) routines on NVIDIA GPUs. Using recursive and batch algorithms, KBLAS maximizes the GPU bandwidth, reuses locally cached data and increases device occupancy. KBLAS represents, therefore, a comprehensive and efficient framework versatile to various workload sizes. Located at the bottom of the usual software stack, KBLAS enables higher-level numerical libraries and scientific applications to extract the expected performance from GPU hardware accelerators.
+
 KBLAS is written in CUDA C. It requires CUDA Toolkit for installation.
 
-* Installation
-  To install KBLAS, you need to have CUDA Toolkit installed (version 5.0 or higher is recommended)
-  All that is required is to edit the make.inc file and then type make. Specify the following in 
-  your make.inc file:
-     - The directory of the CUDA Toolkit installation (default: /usr/local/cuda)
-     - The target GPU architecture: currently "fermi" or "kepler". KBLAS was not tested on pre-fermi GPUs
-
+
+Current Features of KBLAS
+=========================
+
+KBLAS provides highly optimized routines from various levels of BLAS and LAPACK, including:
+
+1. Legacy Level-2 BLAS: (⇟⎐ ⚭ ⚬) SYMV, GEMV, HEMV.
+2. Legacy Level-3 BLAS: (⇟⎐ ⚭ ⚬) TRSM, TRMM, GEMM (⚭ only).
+3. Batch Level-3 BLAS: (⇟⎏ ⚭ ⚬= ✼) TRSM, TRMM, SYRK.
+4. Batch Triangular: (⎏⇞ ⚭ ⚬= ✼) TRTRI, LAUUM.
+5. Batch Symmetric: (⎏⇞ ⚭ ⚬= ✼) POTRF, POTRS, POSV, POTRI, POTI.
+6. Batch General: (⎐⇟ ⚭ ⚬= ✼) GESVJ, GERSVD, GEQRF.
+
+⇟ Standard precisions: s/d/c/z.
+⇞ Real precisions: s/d.
+⎏ Very small matrix sizes.
+⎐ Arbitrary sizes.
+⚬ Single-GPU support.
+⚭ Multi-GPU support.
+= Uniform batch sizes.
+✼ Non-Strided and Strided variants
+
+
+Installation
+============
+
+KBLAS installation requires a recent **make**.
+To build KBLAS, please follow these instructions:
+
+1.  Get KBLAS from git repository
+
+        git clone git@github.com:ecrc/kblas-gpu
+
+    or
+
+        git clone https://github.com/ecrc/kblas-gpu
+
+2.  Go into KBLAS folder
+
+        cd kblas-gpu
+
+3.  Edit file make.inc to:
+    - Enable / disable KBLAS sub modules (_SUPPORT_BLAS2_, _SUPPORT_BLAS3_, _SUPPORT_BATCH_TR_, _SUPPORT_SVD_).
+    - Enable / disable usage of third party libraries (_USE_MKL_, _USE_MAGMA_) for performance comparisons.
+    - Provide path for third party libraries if required (_CUB_DIR_, _MAGMA_ROOT_).
+    - Specify CUDA architecture to compile for (_CUDA_ARCH_).
+
+    or
+
+    - Provide equivalent environment variables.
+
+4.  Build KBLAS
+
+        make
+
+5.  Build local documentation (optional)
+
+        make docs
+
+
+Testing
+=======
+
+The folder 'testing' includes a set of sample programs to illustrate the usage of each KBLAS routine, as well as to test the performance 
+and accuracy of such routines against other vendor libraries.
+
+
+Related Publications
+====================
+
+1. A. Charara, D. Keyes, and H. Ltaief, Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
+*Submitted to ACM Trans. Math. Software*, http://hdl.handle.net/10754/622077, 2017.
+
+2. W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. Keyes, Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix 
+compression, *J. Parallel Comput.*, Special Edition, 2017.
+
+3. A. Abdelfattah, D. Keyes, and H. Ltaief, KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators, *ACM 
+Trans. Math. Software 42(3)*, DOI: http://dx.doi.org/10.1145/2818311, 2016.
+
+4. A. Charara, D. Keyes, and H. Ltaief, A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures, *Concurr. 
+Comput.: Prac. Experience*, http://hdl.handle.net/10754/622077, 2016.
+
+5. A. Charara, H. Ltaief, and D. Keyes, Redesigning Triangular Dense Matrix Computations on GPUs, *22nd International Euro-Par Conference 
+on Parallel and Distributed Computing*, Best papers, DOI: http://dx.doi.org/10.1007/978-3-319-43659-3_35, 2016.
+
+6. A. Abdelfattah, H. Ltaief, and D. Keyes, High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications, *21st 
+International Euro-Par Conference on Parallel and Distributed Computing*, 2015.
+
+7. A. Abdelfattah, D. Keyes, and H. Ltaief, Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU, *18th 
+International Euro-Par Conference on Parallel and Distributed Computing*, 2013.
+
+8. A. Abdelfattah, J. Dongarra, D. Keyes, and H. Ltaief, Optimizing Memory-Bound SyMV Kernel on GPU Hardware Accelerators, *10th 
+International Conference High Performance Computing for Computational Science - VECPAR*, DOI: http://dx.doi.org/10.1007/978-3-642-38718-0_10, 2012.
+
+
+Handout
+=======
+![Handout](docs/KBLAS_handout.png)
diff --git a/docs/KBLAS-brochure.pdf b/docs/KBLAS-brochure.pdf
diff --git a/docs/KBLAS_handout.png b/docs/KBLAS_handout.png
diff --git a/docs/kblas_logo_mini.png b/docs/kblas_logo_mini.png