Skip to content

Commit

Permalink
Merge pull request #17 from ecrc/acharara/batch-triangular
Browse files Browse the repository at this point in the history
v2.0.0
  • Loading branch information
acharara authored Nov 15, 2017
2 parents 86032bf + c4edf09 commit f3b5902
Show file tree
Hide file tree
Showing 238 changed files with 39,118 additions and 9,487 deletions.
2,375 changes: 1,452 additions & 923 deletions Doxyfile

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions INSTALL
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
KBLAS installation requires a recent **make**.
To build KBLAS, please follow these instructions:

1. Get KBLAS from git repository

git clone git@github.com:ecrc/kblas-gpu

or

git clone https://github.com/ecrc/kblas-gpu

2. Go into KBLAS folder

cd kblas-gpu

3. Edit file make.inc to:
- Enable / disable KBLAS sub modules (_SUPPORT_BLAS2_, _SUPPORT_BLAS3_, _SUPPORT_BATCH_TR_, _SUPPORT_SVD_).
- Enable / disable usage of third party libraries (_USE_MKL_, _USE_MAGMA_) for performance comparisons.
- Provide path for third party libraries if required (_CUB_DIR_, _MAGMA_ROOT_).
- Specify CUDA architecture to compile for (_CUDA_ARCH_).

or

Provide equivalent environment variables.

4. Build KBLAS

make

5. Build local documentation (optional)

make docs

KBLAS library will be built in the folder ./lib
Enjoy.
103 changes: 74 additions & 29 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,51 +22,96 @@ pipeline {
}

stages {
stage ('cuda-7.0') {
stage ('cuda-8.0') {
steps {
sh '''#!/bin/bash -le
module load gcc/4.8.5 cuda/7.0; make clean; make all
module load gcc/4.8.5;
module load cuda/8.0
module load intel/16
module list
set -x
export _MAGMA_ROOT_=/opt/ecrc/magma/2.2.0-intel-16-mkl-cuda-8.0
export _CUB_DIR_=$PWD/cub
if [ -d cub ]
then
cd cub; git pull; cd ..
else
git clone https://github.com/NVLABS/cub cub
fi
make clean
make
export CUDA_VISIBLE_DEVICES=2; export NGPUS=1
sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./kblas-test-l2.sh
sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./kblas-test-l2.sh
./kblas-test-l2.sh
sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./kblas-test-l3.py
sed -i "/ranges = /,/\\]/d" ./kblas-test-l3.py
./kblas-test-l3.py
sed -i "/--range 2048:15360:1024/d" ./kblas-test-l3.py
./kblas-test-l3.py
sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./test-scripts/kblas-test-l2.sh
sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./test-scripts/kblas-test-l2.sh
./test-scripts/kblas-test-l2.sh
sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./test-scripts/kblas-test-l3.py
sed -i "/ranges = /,/\\]/d" ./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-l3.py
sed -i "/--range 2048:15360:1024/d" ./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-batch-parallel.py
'''
}
}
stage ('cuda-7.5') {
steps {
sh '''#!/bin/bash -le
module load gcc/4.8.5 cuda/7.5; make clean; make all
module load gcc/4.8.5;
module load cuda/7.5
module load intel/16
module list
set -x
export _MAGMA_ROOT_=/opt/ecrc/magma/2.2.0-intel-16-mkl-cuda-7.5
export _CUB_DIR_=$PWD/cub
if [ -d cub ]
then
cd cub; git pull; cd ..
else
git clone https://github.com/NVLABS/cub cub
fi
make clean
make
export CUDA_VISIBLE_DEVICES=2; export NGPUS=1
sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./kblas-test-l2.sh
sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./kblas-test-l2.sh
./kblas-test-l2.sh
sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./kblas-test-l3.py
sed -i "/ranges = /,/\\]/d" ./kblas-test-l3.py
./kblas-test-l3.py
sed -i "/--range 2048:15360:1024/d" ./kblas-test-l3.py
./kblas-test-l3.py
sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./test-scripts/kblas-test-l2.sh
sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./test-scripts/kblas-test-l2.sh
./test-scripts/kblas-test-l2.sh
sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./test-scripts/kblas-test-l3.py
sed -i "/ranges = /,/\\]/d" ./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-l3.py
sed -i "/--range 2048:15360:1024/d" ./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-batch-parallel.py
'''
}
}
stage ('cuda-8.0') {
stage ('cuda-7.0') {
steps {
sh '''#!/bin/bash -le
module load gcc/4.8.5 cuda/8.0; make clean; make all
module load gcc/4.8.5;
module load cuda/7.0
module load intel/16
module list
set -x
export _MAGMA_ROOT_=/opt/ecrc/magma/2.0.1-intel-16-mkl-cuda-7.0/
export _CUB_DIR_=$PWD/cub
if [ -d cub ]
then
cd cub; git pull; cd ..
else
git clone https://github.com/NVLABS/cub cub
fi
make clean
make
export CUDA_VISIBLE_DEVICES=2; export NGPUS=1
sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./kblas-test-l2.sh
sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./kblas-test-l2.sh
./kblas-test-l2.sh
sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./kblas-test-l3.py
sed -i "/ranges = /,/\\]/d" ./kblas-test-l3.py
./kblas-test-l3.py
sed -i "/--range 2048:15360:1024/d" ./kblas-test-l3.py
./kblas-test-l3.py
sed -i s/STEP_DIM=.*/STEP_DIM=1024/ ./test-scripts/kblas-test-l2.sh
sed -i s/STOP_DIM=.*/STOP_DIM=4096/ ./test-scripts/kblas-test-l2.sh
./test-scripts/kblas-test-l2.sh
sed -i s/"ranges = "/"ranges=\\[\\"--range 128:1024:128\\"\\]\\nranges = "/ ./test-scripts/kblas-test-l3.py
sed -i "/ranges = /,/\\]/d" ./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-l3.py
sed -i "/--range 2048:15360:1024/d" ./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-l3.py
./test-scripts/kblas-test-batch-parallel.py
'''
}
}
Expand Down
4 changes: 2 additions & 2 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2016, Extreme Computing Research Center
Copyright (c) 2012-, King Abdullah University of Science and Technology
All rights reserved.

Redistribution and use in source and binary forms, with or without
Expand All @@ -11,7 +11,7 @@ modification, are permitted provided that the following conditions are met:
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name of KBLAS-GPU nor the names of its
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
all:
(cd src && make -j)
(cd testing && make -j)

clean:
rm -f -v ./lib/*.a
(cd src && make clean)
Expand Down
114 changes: 103 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,107 @@
# kblas-gpu
=========================
KBLAS README FILE
=========================

KBLAS is an optimized library for a subset of Basic Linear Algebra Subroutines (BLAS) on NVIDIA GPUs.
What is KBLAS
=============

KAUST BLAS (KBLAS) is a high performance CUDA library implementing a subset of BLAS as well as Linear Algebra PACKage (LAPACK) routines on NVIDIA GPUs. Using recursive and batch algorithms, KBLAS maximizes the GPU bandwidth, reuses locally cached data and increases device occupancy. KBLAS represents, therefore, a comprehensive and efficient framework versatile to various workload sizes. Located at the bottom of the usual software stack, KBLAS enables higher-level numerical libraries and scientific applications to extract the expected performance from GPU hardware accelerators.

KBLAS is written in CUDA C. It requires CUDA Toolkit for installation.

* Installation
To install KBLAS, you need to have CUDA Toolkit installed (version 5.0 or higher is recommended)
All that is required is to edit the make.inc file and then type make. Specify the following in
your make.inc file:
- The directory of the CUDA Toolkit installation (default: /usr/local/cuda)
- The target GPU architecture: currently "fermi" or "kepler". KBLAS was not tested on pre-fermi GPUs


Current Features of KBLAS
=========================

KBLAS provides highly optimized routines from various levels of BLAS and LAPACK, including:

1. Legacy Level-2 BLAS: (⇟⎐ ⚭ ⚬) SYMV, GEMV, HEMV.
2. Legacy Level-3 BLAS: (⇟⎐ ⚭ ⚬) TRSM, TRMM, GEMM (⚭ only).
3. Batch Level-3 BLAS: (⇟⎏ ⚭ ⚬= ✼) TRSM, TRMM, SYRK.
4. Batch Triangular: (⎏⇞ ⚭ ⚬= ✼) TRTRI, LAUUM.
5. Batch Symmetric: (⎏⇞ ⚭ ⚬= ✼) POTRF, POTRS, POSV, POTRI, POTI.
6. Batch General: (⎐⇟ ⚭ ⚬= ✼) GESVJ, GERSVD, GEQRF.

⇟ Standard precisions: s/d/c/z.
⇞ Real precisions: s/d.
⎏ Very small matrix sizes.
⎐ Arbitrary sizes.
⚬ Single-GPU support.
⚭ Multi-GPU support.
= Uniform batch sizes.
✼ Non-Strided and Strided variants


Installation
============

KBLAS installation requires a recent **make**.
To build KBLAS, please follow these instructions:

1. Get KBLAS from git repository

git clone git@github.com:ecrc/kblas-gpu

or

git clone https://github.com/ecrc/kblas-gpu

2. Go into KBLAS folder

cd kblas-gpu

3. Edit file make.inc to:
- Enable / disable KBLAS sub modules (_SUPPORT_BLAS2_, _SUPPORT_BLAS3_, _SUPPORT_BATCH_TR_, _SUPPORT_SVD_).
- Enable / disable usage of third party libraries (_USE_MKL_, _USE_MAGMA_) for performance comparisons.
- Provide path for third party libraries if required (_CUB_DIR_, _MAGMA_ROOT_).
- Specify CUDA architecture to compile for (_CUDA_ARCH_).

or

- Provide equivalent environment variables.

4. Build KBLAS

make

5. Build local documentation (optional)

make docs


Testing
=======

The folder 'testing' includes a set of sample programs to illustrate the usage of each KBLAS routine, as well as to test the performance
and accuracy of such routines against other vendor libraries.


Related Publications
====================

1. A. Charara, D. Keyes, and H. Ltaief, Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
*Submitted to ACM Trans. Math. Software*, http://hdl.handle.net/10754/622077, 2017.

2. W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. Keyes, Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix
compression, *J. Parallel Comput.*, Special Edition, 2017.

3. A. Abdelfattah, D. Keyes, and H. Ltaief, KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators, *ACM
Trans. Math. Software 42(3)*, DOI: http://dx.doi.org/10.1145/2818311, 2016.

4. A. Charara, D. Keyes, and H. Ltaief, A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures, *Concurr.
Comput.: Prac. Experience*, http://hdl.handle.net/10754/622077, 2016.

5. A. Charara, H. Ltaief, and D. Keyes, Redesigning Triangular Dense Matrix Computations on GPUs, *22nd International Euro-Par Conference
on Parallel and Distributed Computing*, Best papers, DOI: http://dx.doi.org/10.1007/978-3-319-43659-3_35, 2016.

6. A. Abdelfattah, H. Ltaief, and D. Keyes, High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications, *21st
International Euro-Par Conference on Parallel and Distributed Computing*, 2015.

7. A. Abdelfattah, D. Keyes, and H. Ltaief, Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU, *18th
International Euro-Par Conference on Parallel and Distributed Computing*, 2013.

8. A. Abdelfattah, J. Dongarra, D. Keyes, and H. Ltaief, Optimizing Memory-Bound SyMV Kernel on GPU Hardware Accelerators, *10th
International Conference High Performance Computing for Computational Science - VECPAR*, DOI: http://dx.doi.org/10.1007/978-3-642-38718-0_10, 2012.


Handout
=======
![Handout](docs/KBLAS_handout.png)
Binary file added docs/KBLAS-brochure.pdf
Binary file not shown.
Binary file added docs/KBLAS_handout.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/kblas_logo_mini.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit f3b5902

Please sign in to comment.