Convolution is a popular array operation used in signal processing, digital recording, image/video processing, and computer vision. This repository provides 2D convolution algorithm written from scratch in C++ (for CPU) and CUDA C++ (for GPU), which can be used to apply filters to high resolution images.
Tested on NVIDIA RTX 3090 using Ubuntu 24.04.1 LTS with nvidia-driver-560 and CUDA 12.6.
Images are first converted to grayscale, and then the filter is applied.
Table of contents
- Naive 2D convolution on a CPU.
- Naive 2D convolution on a GPU.
- 2D convolution on a GPU using constant memory for filter matrix.
- 2D convolution on a GPU using constant memory for filter matrix and tiling for shared memory usage.
- Naive 2D convolution on a GPU (using pinned memory).
- 2D convolution on a GPU using constant memory for filter matrix (using pinned memory).
- 2D convolution on a GPU using constant memory for filter matrix and tiling for shared memory usage (using pinned memory).
CPU/GPU Filter
-
In the terminal run:
make filters_cpu
ormake filters_gpu
-
You will be asked to enter the location of the image. For example,
data/8k.jpg
. -
You will be asked to type the filter name. Supported filters are as follows:
CPU | GPU (Naive) | GPU (Constant Memory) | GPU (Constant Memory + Tiling) | GPU (Pinned Memory) | GPU (Constant + Pinned Memory) | GPU (Constant + Pinned Memory + tiling) | |
---|---|---|---|---|---|---|---|
Allocating Memory | --- | 0.00044032 | 0.000191488 | 0.000313344 | 0.000217088 | 0.000176064 | 0.000154464 |
Moving input to Memory | --- | 0.0028009 | 0.00271984 | 0.00283443 | 0.00265677 | 0.00267555 | 0.0026567 |
Moving filter to Memory | --- | 8.736e-06 | 0.000128704 | 0.0002504 | 9.632e-06 | 0.000199776 | 0.000105152 |
Kernel execution | 0.0607285 | 5.2029e-05 | 5.16403e-05 | 5.53062e-05 | 4.50765e-05 | 4.3735e-05 | 5.37395e-05 |
Moving output to Memory | --- | 0.00601299 | 0.00601722 | 0.0065999 | 0.00249299 | 0.00250381 | 0.0024945 |
Total | 0.0607285 | 0.00931497 | 0.00910889 | 0.0100534 | 0.00542156 | 0.00559894 | 0.00546456 |
make 00_cpu_conv2d_benchmark.out
Loaded image with Width: 2048 and Height: 1328
Applying filter...
Time for kernel execution (seconds): 0.0607285
---------------------
Benchmarking details:
---------------------
FPS (total): 16.4667
GFLOPS (kernel): 1.2432
------------------------------------
make 01_gpu_conv2d_benchmark.out
Loaded image with Width: 2048 and Height: 1328
Allocating GPU memory...
Time for GPU memory allocation (seconds): 0.00044032
Moving input to GPU memory...
Time for input data transfer (seconds): 0.0028009
Moving filter to GPU memory...
Time for filter data transfer (seconds): 8.736e-06
Applying filter...
Time for kernel execution (seconds): 5.20294e-05
Moving result to CPU memory...
Time for output data transfer (seconds): 0.00601299
---------------------
Benchmarking details:
---------------------
Time (total): 0.00931497
FPS (total): 107.354
Time (kernel): 5.20294e-05
FPS (kernel): 19219.9
GFLOPS (kernel): 1451.05
------------------------------------
make 02_gpu_conv2d_constMem_benchmark.out
Loaded image with Width: 2048 and Height: 1328
Allocating GPU memory...
Time for GPU memory allocation (seconds): 0.000191488
Moving input to GPU memory...
Time for input data transfer (seconds): 0.00271984
Moving filter to GPU memory...
Time for filter data transfer (seconds): 0.000128704
Applying filter...
Time for kernel execution (seconds): 5.16403e-05
Moving result to CPU memory...
Time for output data transfer (seconds): 0.00601722
---------------------
Benchmarking details:
---------------------
Time (total): 0.00910889
FPS (total): 109.783
Time (kernel): 5.16403e-05
FPS (kernel): 19364.7
GFLOPS (kernel): 1461.99
------------------------------------
make 03_gpu_conv2d_tiled_benchmark.out
Loaded image with Width: 2048 and Height: 1328
Allocating GPU memory...
Time for GPU memory allocation (seconds): 0.000313344
Moving input to GPU memory...
Time for input data transfer (seconds): 0.00283443
Moving filter to GPU memory...
Time for filter data transfer (seconds): 0.0002504
Applying filter...
Time for kernel execution (seconds): 5.53062e-05
Moving result to CPU memory...
Time for output data transfer (seconds): 0.0065999
---------------------
Benchmarking details:
---------------------
Time (total): 0.0100534
FPS (total): 99.469
Time (kernel): 5.53062e-05
FPS (kernel): 18081.1
GFLOPS (kernel): 1365.08
------------------------------------
make 04_gpu_conv2d_pinnedMem_benchmark.out
Loaded image with Width: 2048 and Height: 1328
Allocating GPU memory...
Time for GPU memory allocation (seconds): 0.000217088
Moving input to GPU memory...
Time for input data transfer (seconds): 0.00265677
Moving filter to GPU memory...
Time for filter data transfer (seconds): 9.632e-06
Applying filter...
Time for kernel execution (seconds): 4.50765e-05
Moving result to CPU memory...
Time for output data transfer (seconds): 0.00249299
---------------------
Benchmarking details:
---------------------
Time (total): 0.00542156
FPS (total): 184.449
Time (kernel): 4.50765e-05
FPS (kernel): 22184.5
GFLOPS (kernel): 1674.88
------------------------------------
make 05_gpu_conv2d_pinnedConstMem_benchmark.out
Loaded image with Width: 2048 and Height: 1328
Allocating GPU memory...
Time for GPU memory allocation (seconds): 0.000176064
Moving input to GPU memory...
Time for input data transfer (seconds): 0.00267555
Moving filter to GPU memory...
Time for filter data transfer (seconds): 0.000199776
Applying filter...
Time for kernel execution (seconds): 4.3735e-05
Moving result to CPU memory...
Time for output data transfer (seconds): 0.00250381
---------------------
Benchmarking details:
---------------------
Time (total): 0.00559894
FPS (total): 178.605
Time (kernel): 4.3735e-05
FPS (kernel): 22865
GFLOPS (kernel): 1726.25
------------------------------------
make 06_gpu_conv2d_pinnedTiled_benchmark.out
Loaded image with Width: 2048 and Height: 1328
Allocating GPU memory...
Time for GPU memory allocation (seconds): 0.000154464
Moving input to GPU memory...
Time for input data transfer (seconds): 0.0026567
Moving filter to GPU memory...
Time for filter data transfer (seconds): 0.000105152
Applying filter...
Time for kernel execution (seconds): 5.37395e-05
Moving result to CPU memory...
Time for output data transfer (seconds): 0.0024945
---------------------
Benchmarking details:
---------------------
Time (total): 0.00546456
FPS (total): 182.997
Time (kernel): 5.37395e-05
FPS (kernel): 18608.3
GFLOPS (kernel): 1404.88
------------------------------------
-
Image load/save done using stb single-file public domain libraries for C/C++. Check out lib for the specific source code.
-
Example images in data: