Archive for Cuda

NVIDIA 2D Image and Signal Performance Primitives (NPP)

Based on the lack of examples and discussion in the forums, I assume the NPP are under-utilized and under-appreciated.  Since I discovered these, it has been a game changer for me in my image processing work. Since machine vision camera resolutions are now at 12 Mega-pixels and higher, its required to accelerate processing with a GPU. No longer do I need to create many of my own Cuda algorithms for 2D image processing – many of them already exist.

For example, resizing an image (x,y re-scale) is fully supported on any pixel data type and with multiple filter types, all accelerated with Cuda parallel operations (see my post and example project on an image resize implementation here).

The NVIDIA documentation is a bit sparse, the shear number of functions and sub-libraries are daunting. I suggest starting with this page.

https://docs.nvidia.com/cuda/npp/modules.html

Within this page, open the topics and drill down, I think you will be impressed with the number of Cuda functions available.

Comments off

Calling Cuda functions from C#

This is a demonstration of creating a C# wrapper for a Cuda function.

The example Cuda function is ‘invertImageCuda()’ and it is contained in a Cuda dll called ‘image_processor.dll’. This dll file must exist in the same directory as the C# exe or in the path.

The C# File

In a C# file, create a C# entry point called ‘Invert()’. This entry point is a standard C# function and can be passed in any complex C# object type.

    /// <summary>
    /// Takes an array of float values, assumed to be pixels ranging from 0,1. Applies 'pixel = 1 - pixel' to all pixels in parallel Cuda operations.
    /// Original array is un-changed, inverted image is returned in a new array. 
    /// </summary>
    /// <param name="SrcPixels"></param>
    /// <param name="srcWidth"></param>
    /// <param name="srcHeight"></param>
    /// <returns></returns>
    public static float[] Invert(float[] SrcPixels, int srcWidth, int srcHeight)
    {
        float[] DstPixels = new float[srcWidth * srcHeight];

        unsafe
        {
            GCHandle handleSrcImage = GCHandle.Alloc(SrcPixels, GCHandleType.Pinned);
            float* srcPtr = (float*)handleSrcImage.AddrOfPinnedObject();

            GCHandle handleDstImage = GCHandle.Alloc(DstPixels, GCHandleType.Pinned);
            float* dstPtr = (float*)handleDstImage.AddrOfPinnedObject();

       
            // call a local function that takes c style raw pointers
            // this local function will in turn call the Cuda function
            invert(srcPtr, dstPtr, srcWidth, srcHeight);

            handleSrcImage.Free();
            handleDstImage.Free();
            GC.Collect();
        }
        return DstPixels;
    }

The ‘unsafe’ block tells C# that we are intentionally using raw c-style pointers. In the Visual Studio project properties editor, we must also check the box that allows un-safe code.

The GCHandle.Alloc() call creates a pinned pointer to a float[] so that the garbage collector cannot move the memory while the Cuda program is accessing it. We need to create a pinned pointer (GCHandle) for both the source and destination arrays.

The AddrOfPinnedObject() returns the pinned pointer that was allocated in the Alloc() function. We need c-style raw pointers to pass into the Cuda program.

A local function, invert(), will be called passing in only simple objects of pointers and int-s.

In the same C# file, create the Cuda wrapper function:

        [DllImport("image_processor.dll")]
        unsafe static extern int invertImageCuda(float* src, float* dst, Int32 width, Int32 sheight); 
        unsafe static int invert(float* src, float* dst, Int32 width, Int32 height)
        {
            return invertImageCuda(src, dst, width, height);
        }

The DllImport() line must be immediately above the Cuda function extern declaration and tells the compiler to look for invertImageCuda() in the dll.

The ‘invert()’ function is a local static, unsafe, function that accepts raw c-style pointers and then calls into the Cuda function, returning the value returned from Cuda (which is a success/error int value). The dst pointer is used by the Cuda function as the location to write the output values.

The Cuda File

In a separate Cuda file, in Cuda dll project, create the entry point:

//invertimage.h
#ifndef INVERTIMAGE_H
#define INVERTIMAGE_H
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

// public


#ifdef __cplusplus
extern "C" {
#endif

#define CUDA_CLIB_EXPORTS
#ifdef CUDA_CLIB_EXPORTS
#define CUDA_CLIB_API __declspec(dllexport) 
#else
#define CUDA_CLIB_API __declspec(dllimport) 
#endif

    CUDA_CLIB_API cudaError_t invertImageCuda(float* src, float* dst, unsigned int width, unsigned int height);

#ifdef __cplusplus
}
#endif

//private

__global__ void invertImageKernel(float* src, float* dst, unsigned int width, unsigned int height);



#endif

This c header will not be read or used by the C# program, but rather, the C# compiler will rely on the invertImageCuda() matching declaration in the C# file. But this header with the CUDA_CLIB_API __declspec(dllexport) will tell the Cuda build to export this function as a public function. The CUDA_CLIB_EXPORTS preprocessor variable is defined locally because the cuda compiler of invert.cu will be the only compiler to see this code.

Comments off

Example implementation of nppiResize_32f_C1R_Ctx()

Project Source Code

The project source can be found here:

https://github.com/daviddrell/image_proc_samples

The project structure is Visual Studio 2019 with Cuda 11.7 installed. If you are using a different version of Cuda, I find the easiest was to solve this is to edit the Visual Studio project file in a text editor and change the version number there.

Overview

This is an example of re-scaling the size of the image in gray-scale floating point format accelerated using cuda on a GPU.

This example creates a simulated image of 2048×2048. In actual image processing applications you will have an image that comes from a jpeg or tiff file and must be decoded, often into an array of RGB bytes or directly into a gray-scale format. Many image processing operations occur on a gray-scale version of the image encoded as floating point, typically of values 0 to 1, or -1 to +1.

NVIDIA cuda comes with a library of basic image processing functions which are accelerated with parallel operations on the GPU, that run on top of the cuda library.

One of these functions is nppiResize_32f_C1R_Ctx(). The file resize.cpp implements all the memory operations necessary to resize an image using nppiResize_32f_C1R_Ctx().

The file resize.h provides a simple entry point for an image resize function which can be called from a c program with no knowledge of cuda programming.

Code Details

Refer to the gitlab project link. The sample entry point from a c programing perspective is given in main.cu. The example implementation of nppiResize_32f_C1R_Ctx() is given in resize.cpp.

Sample Results

In a machine learning application, I needed to analyze a biological image (cells growing into vessel structures imaged under a microscope). The scientist provided images that were sized at 5995 x 6207 pixels. This size is too extreme for the requirements of extracting the structures. Additionally, the AI models were trained on images typically in the range of 1000×1000 to 2000×2000 pixels. So I scale down the images using the resize_Cuda() function demonstrated in the example project.

Here is the original image that is too large:

Original Image at 5995 x 6207 pixels.

Here is the downsized image at 2000 x 2070 (the width was set to be 2000, our max AI trained size, the height was calculated to be 2070 to maintain the aspect ratio):

downsized image at 2000 x 2070 pixels

Here is the result of the analysis showing the branches and loops detected:

final analysis output

Comments off