Copying elements from an char array (pointer) to another in ocl?

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
The conditional in the for loop appears suspect:
Code:
for(unsigned long i=0;i<=*arraySize;i++) { ... }
The kernel would copy everything in the buffer as well as an extra byte past the end. Not sure if this is directly related to the problem you are seeing, but at the very least this stands out as 'undefined behavior' or 'buffer overflow'.
 

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
Looked at your main a bit closer because I was trying to figure out why you were reserving a byte past the end of the bufferSize. So in your readfilec() function, curiously you allocate a byte array size [filesize + 1], but then later on you appear to be attempting to append a null terminator at the end of the byte stream:
Code:
content = new unsigned char[*filesize + 1];

...
content[*filesize + 1] = '\0';
So, I'm now on board with that. However, the issue I see here is that the null terminator assignment via index [filesize + 1] is +1 past the end of the array. So this is definitely undefined behavior.

The weird artifact of this is, despite you missing the mark, that the compiler will zero initialize the array, so you end up with the array null terminated for free anyway.

Edit: You also never free fileBytes, but that's pedantic.
 
Last edited:

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
Also, your kernel appears to have race conditions. Curious, are you new to parallel programming?
 

PontiacGTX

Gawd
Joined
Aug 9, 2013
Messages
732
Well yes but I found the issue it was the buffer which casting (void*) caused some problem..now i have a different issue I would need to use the char ptr from the vector<unsigned char> into a kernel and take another pointer to another vector but with string maybe char* ? the thing is I would need to compare if the vectorCharIndex+0....nSize ==a element from the string vector(though the first elements in the dictionary are beloging to the bytes from the 1st vector)

I think due to the complexity of C kernel and lack of libraries
Do you consider is worth moving to openmp(using mingw 9,1 on windows 10) or computecpp? if so which one?
 

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
Casting void* is taking the type safety off, but sometimes you gotta do it. Since C isn't polymorphic, that's the next best thing. You just got to be extra careful when casting back and forth. Especially with arrays, where the extra indirection might not be obvious (void** seems strange at first, but it's a pattern that shows up a lot in C APIs).

Regarding STL objects: You are playing with fire passing in an STL object to/from a kernel. They are not guaranteed to be a contiguous layout in memory, which is required for device memory transfer. You pretty much need to make your own serializable version to ensure the object in memory can be translated as contiguous data. If arrays seem frustrating, try using POD (plain ol' data) structs containing simple types. They can be passed to/from a kernel, and it makes the code much easier to read.

I'm not familiar with openmp or computecpp, but I'm very familiar with CUDA C. OpenCL is very similar to CUDA C. Strikingly so. However, CUDA C has the same limitations with passing data to/from kernels as it must also be contiguous data. I suspect the others you mentioned will have the same limitations. You're just going to have to work within those limitations.

In my experience, writing a kernel is all about divide an conquer. Kernels are supposed to be simple. Too much complexity and your performance will probably tank anyway, so doing the work exclusively on the host ends up being more appealing when things get out of hand. Most examples you find online will guide you to carve your work up based on the number of threads available per compute unit/block. Then you can pass in a N value to your kernel that gives hints as to which thread gets to work on the tail end of the data (because it's rare for data to align perfectly with the # of threads per block). Once you get the hang of doing something that simple, then you can move on to more complex techniques such as using thread barriers and local memory.

I have to admit, the implicit size+1 in your code threw me off. I find that to be code smell. Might be best to rethink how you are iterating over and describing your array sizes. First and foremost, be consistent with whatever convention you arrive at. Like I mentioned earlier, use POD structs comprised of simple types (char, int, floats, arrays, etc.). They let you organize your data into a human readable manner and are easier to maintain as contiguous regions in memory, which is ultimately the requirement.
 

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
Here's what your CopyBytes kernel would look like with a simple divide and conquer approach:
Code:
__kernel void CopyBytes( __global unsigned char* fileBytes,
                        const unsigned long arraySize,
                        __global unsigned char* compressedBytes,
                        __global unsigned long* arrayCountCompressedBytes,
                        const unsigned int N,
                        const unsigned int bytesPerThread )
{
    // Get global thread id
    const int id = get_global_id(0);
   
    if ( id < N ) {
        const unsigned long start = id * bytesPerThread;
        unsigned long end;
        if ( id < ( N - 1 ) ) {
            end = start + bytesPerThread;
        } else {
            // last thread must compensate for unaligned data
            end = arraySize;
        }
       
        // Do actual work
        for ( unsigned long i = start; i < end; i++ ) {
            compressedBytes[start + i] = fileBytes[start + i];
            arrayCountCompressedBytes[start + i] = i;
        }
    }
}
(did not verify this actually works, but I imagine it's something close to working)

The trick here is that the host will calculate the bytesPerThread necessary based on N threads. You can do something like bytesPerThread = ceil( totalBytes / N = bytesPerThread ).

When the kernel is executed, only thread ids < N do any work and the last thread (N-1) has to deal with the odd remaining bytes. Since all threads are executing in parallel, and their calculations do not depend on each other, this is a rather straightforward kernel.
 

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
If writing kernels isn't your jam, you might want to look into C++17's parallel algorithms. They are a more canned approach, so less knobs and buttons for performance tweaking, but there is much less awkward code to deal with. A lot of object-oriented programming languages are starting to pick up on parallel concepts (i.e. Fortran 2018, .NET 4.0, Java 8). And depending on your performance needs, might be good enough to do the job.
 

PontiacGTX

Gawd
Joined
Aug 9, 2013
Messages
732
Casting void* is taking the type safety off, but sometimes you gotta do it. Since C isn't polymorphic, that's the next best thing. You just got to be extra careful when casting back and forth. Especially with arrays, where the extra indirection might not be obvious (void** seems strange at first, but it's a pattern that shows up a lot in C APIs)..
Well some examples were casting the object as void* for some reason but it seems casting the buffer's source object is a bad choice

Regarding STL objects: You are playing with fire passing in an STL object to/from a kernel. They are not guaranteed to be a contiguous layout in memory, which is required for device memory transfer. You pretty much need to make your own serializable version to ensure the object in memory can be translated as contiguous data. If arrays seem frustrating, try using POD (plain ol' data) structs containing simple types. They can be passed to/from a kernel, and it makes the code much easier to read.
Vector are just arrays with some functions they are the same concept as contiguos data the difference is that the vector manages itself the size and doubles its buffer when it is full (unless the function shrink_to_fit()/resize() has been called)

I'm not familiar with openmp or computecpp, but I'm very familiar with CUDA C. OpenCL is very similar to CUDA C. Strikingly so. However, CUDA C has the same limitations with passing data to/from kernels as it must also be contiguous data. I suspect the others you mentioned will have the same limitations. You're just going to have to work within those limitations..
openmp it seems that now only can do single threaded applications (GCC 9)
unknown.png

and computecpp is an implementation of sycl for Windows and seems to use OpenCL1.2 for Sycl https://www.khronos.org/assets/uploads/developers/library/2017-supercomputing/SYCL-Compute-Cpp-Overview_Nov17.pdf also uses C++ for the source instead plain c like OpenCL

In my experience, writing a kernel is all about divide an conquer. Kernels are supposed to be simple. Too much complexity and your performance will probably tank anyway, so doing the work exclusively on the host ends up being more appealing when things get out of hand. Most examples you find online will guide you to carve your work up based on the number of threads available per compute unit/block. Then you can pass in a N value to your kernel that gives hints as to which thread gets to work on the tail end of the data (because it's rare for data to align perfectly with the # of threads per block). Once you get the hang of doing something that simple, then you can move on to more complex techniques such as using thread barriers and local memory.
well I have seem some code examples which use for NDRange's argument(globalsize) as length of size of the file byte count and localsize 64
I have to admit, the implicit size+1 in your code threw me off. I find that to be code smell. Might be best to rethink how you are iterating over and describing your array sizes. First and foremost, be consistent with whatever convention you arrive at. Like I mentioned earlier, use POD structs comprised of simple types (char, int, floats, arrays, etc.). They let you organize your data into a human readable manner and are easier to maintain as contiguous regions in memory, which is ultimately the requirement.
when you suggest object composed of primitives types you suggest it for use of the API? because I had the header with the basic objects I needed
https://gist.github.com/PontiacGTX/c0ddb06320718b4343858a9c95512e39

I added an extra character as NULL just to be sure there was one beyond the limit of the file and in that case It would be null terminated and once I started to compare with the elements if it was null it reached the limit of the array also I had done that because I was reading a file with C functions I wasnt sure it would add some character that point to the of the array
 
Last edited:

PontiacGTX

Gawd
Joined
Aug 9, 2013
Messages
732
Here's what your CopyBytes kernel would look like with a simple divide and conquer approach:
Code:
__kernel void CopyBytes( __global unsigned char* fileBytes,
                        const unsigned long arraySize,
                        __global unsigned char* compressedBytes,
                        __global unsigned long* arrayCountCompressedBytes,
                        const unsigned int N,
                        const unsigned int bytesPerThread )
{
    // Get global thread id
    const int id = get_global_id(0);
 
    if ( id < N ) {
        const unsigned long start = id * bytesPerThread;
        unsigned long end;
        if ( id < ( N - 1 ) ) {
            end = start + bytesPerThread;
        } else {
            // last thread must compensate for unaligned data
            end = arraySize;
        }
     
        // Do actual work
        for ( unsigned long i = start; i < end; i++ ) {
            compressedBytes[start + i] = fileBytes[start + i];
            arrayCountCompressedBytes[start + i] = i;
        }
    }
}
(did not verify this actually works, but I imagine it's something close to working)

The trick here is that the host will calculate the bytesPerThread necessary based on N threads. You can do something like bytesPerThread = ceil( totalBytes / N = bytesPerThread ).

When the kernel is executed, only thread ids < N do any work and the last thread (N-1) has to deal with the odd remaining bytes. Since all threads are executing in parallel, and their calculations do not depend on each other, this is a rather straightforward kernel.
In case I wanted to use a string array(different strings in an array) in the host what should I use for the kernel? const char** const char*? I feel that in plain C there isnt a way to compare a two strings without using standard library(or making a really complicated kernel), or is there?
writing a character to compare with the string array is easy but determining where ends a string in a string array in C(using the firstpointer)
 

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
In case I wanted to use a string array(different strings in an array) in the host what should I use for the kernel? const char** const char*? I feel that in plain C there isnt a way to compare a two strings without using standard library(or making a really complicated kernel), or is there?
writing a character to compare with the string array is easy but determining where ends a string in a string array in C(using the firstpointer)
const char** seems the simplest and most natural to me, maybe because I like the easy route and am comfortable with C. The disadvantage is this is nothing more than an array of fixed sized char buffers. This could lead to inefficient device memory usage if the buffers are largely empty. And memory usage could explode out of control if dealing with lots of strings.

std::string is seductive because it dynamically allocates memory on the heap to fit the current underlying char array. This is fine on the host. The reason you can't directly use std::string on the device is that std::string's underlying char* points to memory on the host, and that memory simply does not exist when you copy the std::string object to the device. It's a classic shallow copy. So in your kernel when the methods in the std::string try to deference the underlying char* it will likely cause an out of bounds error because the pointer is no good. Objects that internally use pointers are just too dangerous to use. That pretty much rules out most of the STL. So no std::vector either.

If you are really determined to make std::string work, you'll need to go down the serializer path. In the case of std::string, a serializer would convert the underlying char array and its int size() and stuff them in a byte array. Then you transfer those bytes to the device and unserialize them back to a std::string, usually using std::string's constructor that accept a char* and size. But then you have the inefficiency of std::string constructor copying the bytes instead of letting you wrap or take ownership of the char array. You could create your own custom std::allocator, but then you're already getting into a lot of work.

Then there's the issue of how you handle an array of these deserialized string objects that are variable in length. So now you need a std::vector serializer.

You're better off writing your own class that basically implements everything you need from a dynamic array full of objects to the string itself, with the intention of making the serialization/deserialization easy.

For the serializable string, the serialized form could begin the byte array with the size of the string while the remainder is the actual string contents. Implicit or explicit null terminator is left up to the implementor. Deserialization here could simply wrap the underlying buffer using a specialized constructor. Your custom string's index operators can then traverse the buffer doing all that pointer math you'd have to do if you just dealt with raw buffers. That's fast and efficient.

For the serializable dynamic array, since the assumption is that it holds serialized objects, it's effectively one big byte array containing the serialized objects stuffed in it at various indices. You'll have to work out the mapping of offsets within the array to where each serialized object starts. And then finally, you'll need to come up with a deserializer for the dynamic array class itself.

Just to overstate, each deseralized object should be, at minimum, a pointer to the first byte in some pre-allocated byte buffer and a size describing the length.

TLDR;

But if you're too lazy to do any of that, just use boost::serialization. :sneaky:
 
Last edited:

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
Going back to your char** question, it comes to mind that it's not good advice to accept that at face value. What it boils down to is if the object in question all resides in one contiguous piece of memory. You can't tell from char** alone if that's the case, since it's just a pointer to an array of pointers. All the pointers need to point to memory locations that are part of one contiguous piece of memory. If any one of them does not, then all bets are off.

So these 2D array allocations are ok:
Code:
char Arr[256][2048];  // stack alloc, but technically legit

char (*pArr)[2048] = new int[256][2048];  // heap alloc
Because it's guaranteed to be allocated as one continuous piece.

However, this is not:
Code:
char** ppArr = new char*[256];

for ( int i = 0; i < 256; ++i )
    ppArr[i] = new int[2048];
There's a subtle difference, but basically you have to avoid using new inside your object. Everything needs to be allocated at once.

You can't even pass a C++ class/struct willy nilly to the device because the vtable pointer won't carry over. Your class/struct needs to be a POD type. And even then, you can't really have pointer members or members that are classes that internally use pointers. That's why you can't rely on the STL, because most of the library is implementation defined. Odds are your favorite go-to class is neither POD nor contiguous.

So if you are careful and create your own POD struct/class using the basic types, int, char, float, etc. then you can pull it off without implementing some fancy serializer. You just directly cast the address of the struct to a void* or char* and then on the device cast it back. Because it's a POD type and contiguous, you can also rely on sizeof() to report the effective byte array length.

What you can't do is use pointers inside your struct. You can use arrays inside the struct, but they can't be dynamic. And the arrays must not be allocated using new or malloc. You pretty much are stuck using the char[256][2048] declaration form above.
 
Last edited:

PontiacGTX

Gawd
Joined
Aug 9, 2013
Messages
732
std::vector is just another way to use an array basically I could allocated a std:vector<cl_char*> compressedBytes(size); and add to the kernel's buffer as compressedBytes.data() this would be a cl_char**
 
Last edited:

Absalom

Gawd
Joined
Oct 3, 2007
Messages
711
std::vector is just another way to use an array basically I could allocated a std:vector<cl_char*> compressedBytes(size); and add to the kernel's buffer as compressedBytes.data() this would be a cl_char**
Sure, that would be a form of serialization of the data (you'll still need to copy the size() value). However, you still have the inefficiency of reassembling it back into a std::vector on the device side. std::vector doesn't offer a constructor that wraps an array. And it doesn't allow mutation of the internal pointer data() uses. So it's a copy operation. Given hundreds if not thousands of threads, that's something you don't want to do N times. This might be where _local memory comes in handy. You might be better off just using the raw array directly.

So, at best, std::vector would buy you some OO practices on the host.

If it were me, and I really wanted to use OO practices on both ends, I'd create my own std::vector (or std::array) with a constructor that wraps a raw byte array.
 
Last edited:
Top