Archive for May, 2023

NVIDIA 2D Image and Signal Performance Primitives (NPP)

Based on the lack of examples and discussion in the forums, I assume the NPP are under-utilized and under-appreciated.  Since I discovered these, it has been a game changer for me in my image processing work. Since machine vision camera resolutions are now at 12 Mega-pixels and higher, its required to accelerate processing with a GPU. No longer do I need to create many of my own Cuda algorithms for 2D image processing – many of them already exist.

For example, resizing an image (x,y re-scale) is fully supported on any pixel data type and with multiple filter types, all accelerated with Cuda parallel operations (see my post and example project on an image resize implementation here).

The NVIDIA documentation is a bit sparse, the shear number of functions and sub-libraries are daunting. I suggest starting with this page.

https://docs.nvidia.com/cuda/npp/modules.html

Within this page, open the topics and drill down, I think you will be impressed with the number of Cuda functions available.

Comments off

Calling Cuda functions from C#

This is a demonstration of creating a C# wrapper for a Cuda function.

The example Cuda function is ‘invertImageCuda()’ and it is contained in a Cuda dll called ‘image_processor.dll’. This dll file must exist in the same directory as the C# exe or in the path.

The C# File

In a C# file, create a C# entry point called ‘Invert()’. This entry point is a standard C# function and can be passed in any complex C# object type.

    /// <summary>
    /// Takes an array of float values, assumed to be pixels ranging from 0,1. Applies 'pixel = 1 - pixel' to all pixels in parallel Cuda operations.
    /// Original array is un-changed, inverted image is returned in a new array. 
    /// </summary>
    /// <param name="SrcPixels"></param>
    /// <param name="srcWidth"></param>
    /// <param name="srcHeight"></param>
    /// <returns></returns>
    public static float[] Invert(float[] SrcPixels, int srcWidth, int srcHeight)
    {
        float[] DstPixels = new float[srcWidth * srcHeight];

        unsafe
        {
            GCHandle handleSrcImage = GCHandle.Alloc(SrcPixels, GCHandleType.Pinned);
            float* srcPtr = (float*)handleSrcImage.AddrOfPinnedObject();

            GCHandle handleDstImage = GCHandle.Alloc(DstPixels, GCHandleType.Pinned);
            float* dstPtr = (float*)handleDstImage.AddrOfPinnedObject();

       
            // call a local function that takes c style raw pointers
            // this local function will in turn call the Cuda function
            invert(srcPtr, dstPtr, srcWidth, srcHeight);

            handleSrcImage.Free();
            handleDstImage.Free();
            GC.Collect();
        }
        return DstPixels;
    }

The ‘unsafe’ block tells C# that we are intentionally using raw c-style pointers. In the Visual Studio project properties editor, we must also check the box that allows un-safe code.

The GCHandle.Alloc() call creates a pinned pointer to a float[] so that the garbage collector cannot move the memory while the Cuda program is accessing it. We need to create a pinned pointer (GCHandle) for both the source and destination arrays.

The AddrOfPinnedObject() returns the pinned pointer that was allocated in the Alloc() function. We need c-style raw pointers to pass into the Cuda program.

A local function, invert(), will be called passing in only simple objects of pointers and int-s.

In the same C# file, create the Cuda wrapper function:

        [DllImport("image_processor.dll")]
        unsafe static extern int invertImageCuda(float* src, float* dst, Int32 width, Int32 sheight); 
        unsafe static int invert(float* src, float* dst, Int32 width, Int32 height)
        {
            return invertImageCuda(src, dst, width, height);
        }

The DllImport() line must be immediately above the Cuda function extern declaration and tells the compiler to look for invertImageCuda() in the dll.

The ‘invert()’ function is a local static, unsafe, function that accepts raw c-style pointers and then calls into the Cuda function, returning the value returned from Cuda (which is a success/error int value). The dst pointer is used by the Cuda function as the location to write the output values.

The Cuda File

In a separate Cuda file, in Cuda dll project, create the entry point:

//invertimage.h
#ifndef INVERTIMAGE_H
#define INVERTIMAGE_H
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

// public


#ifdef __cplusplus
extern "C" {
#endif

#define CUDA_CLIB_EXPORTS
#ifdef CUDA_CLIB_EXPORTS
#define CUDA_CLIB_API __declspec(dllexport) 
#else
#define CUDA_CLIB_API __declspec(dllimport) 
#endif

    CUDA_CLIB_API cudaError_t invertImageCuda(float* src, float* dst, unsigned int width, unsigned int height);

#ifdef __cplusplus
}
#endif

//private

__global__ void invertImageKernel(float* src, float* dst, unsigned int width, unsigned int height);



#endif

This c header will not be read or used by the C# program, but rather, the C# compiler will rely on the invertImageCuda() matching declaration in the C# file. But this header with the CUDA_CLIB_API __declspec(dllexport) will tell the Cuda build to export this function as a public function. The CUDA_CLIB_EXPORTS preprocessor variable is defined locally because the cuda compiler of invert.cu will be the only compiler to see this code.

Comments off

create tensorflow data set from a list of numpy images

It took me a while to figure out the most optimal way to do this, so I thought I would share it.

I was originally using tf.stack() to create a tensor from a python list of numpy images. This operation was taking 3.37 seconds to stack a list of 40 images of 256x256x3 of type uint8.

Then I found tf.convert_to_tensor() and this reduced the operation down to 7 milliseconds.

    for image_file in images_list:
        img = cv2.imread(image_file)
        height,width= img.shape[:2]

        # my model design is based on creating 256x256 patches from larger images
        patches = patch_processor.extract_patches(img)


        tensor_list=[]
        for patch in patches:
            # convert 8 bit RGB image to floating point 0,1
            np_image_data = np.asarray(patch,dtype=np.float32)
            np_image_data = np_image_data / 127.5 - 1
            rgb_tensor = tf.convert_to_tensor(np_image_data, dtype=tf.float32)
            tensor_list.append(np.expand_dims(rgb_tensor, axis=0))
        
        # make one multi-demisional tensor that contains all the tensor patches for batch prediction


        ## this was taking 3.37 seconds for 36 images of 256x256
        #patches_tensor = tf.stack(tensor_list) 

        # convert python list to np array of image patches

        patches_tensor = np.array(tensor_list) 

        # create a tensorflow dataset from the array of image patches
        
        dataset1 = tf.data.Dataset.from_tensor_slices(patches_tensor) 
 
        # predict the patches

        predictions = model.predict( dataset1)

Comments off