Canonical way to dice/carve/parcel a single malloc into multiple arrays with varying types/alignments?

Question

Background: I have a C99 routine that needs temporary storage of varying datatypes with varying alignment requirements. Currently I call posix_memalign multiple times which a) introduces lots of overhead and b) does not guarantee that my temporaries have good memory locality. I cannot pack the temporaries into a single struct as the size requirements are only known at runtime.

Question: I want to call malloc (or something similar) once with a sufficiently large size that I can dice up/carve up/parcel out individual pointers with the required alignments. Is there a canonical way to accomplish this task within C99?

A possible (but unlikely) answer for a concrete example: Say I want to allocate enough space for a char[3], a double[m] with 16-byte alignment, an int, and a float[n] with 16-byte alignment and that I require them in memory in that order. Please ignore that the order is stupid and the example contrived. My actual use case is a a control struct followed by several temporary arrays of mixed integer/numeric types with alignments allowing SSE operations.

Using the ideas from How to allocate aligned memory only using the standard library? one might do:

// Several unnecessary values (e.g. alignment_c) are holdovers
// from the macros generating this logic.

// ell, m, and n are sizes known only at runtime

const size_t    datasize_c  = 3*sizeof(char);
const size_t    alignment_c = __alignof__(char);
const size_t    pad_c       = alignment_c - 1;
const uintptr_t mask_c      = ~(uintptr_t)(alignment_c - 1);

const size_t    datasize_d  = ell*sizeof(double);
const size_t    alignment_d = __alignof__(double) > 16 ? __alignof__(double) : 16;
const size_t    pad_d       = alignment_d - 1;
const uintptr_t mask_d      = ~(uintptr_t)(alignment_d - 1);

const size_t    datasize_i  = m*sizeof(int);
const size_t    alignment_i = __alignof__(int);
const size_t    pad_i       = alignment_i - 1;
const uintptr_t mask_i      = ~(uintptr_t)(alignment_i - 1);

const size_t    datasize_f  = n*sizeof(float);
const size_t    alignment_f = __alignof__(float) > 16 ? __alignof__(float) : 16;
const size_t    pad_f       = alignment_f - 1;
const uintptr_t mask_f      = ~(uintptr_t)(alignment_f - 1);


const size_t p_parcel = (datasize_c + pad_c)
                      + (datasize_d + pad_d)
                      + (datasize_i + pad_i)
                      + (datasize_f + pad_f) ;

void   * const p = malloc(p_parcel) ;
char   *       c = (void *) (((uintptr_t)(p      ) + pad_c & mask_c));
double *       d = (void *) (((uintptr_t)(c + ell) + pad_d & mask_d));
int    *       i = (void *) (((uintptr_t)(d + m  ) + pad_i & mask_i));
float  *       f = (void *) (((uintptr_t)(i + n  ) + pad_f & mask_f));

// check if p is NULL, use (c, d, i, f), then free p

I believe this possibility is functionally correct but I'm wondering if anyone has a better, cleaner, shorter way?

I think approaches using the struct hack aren't feasible because I can only guarantee alignment of one array using a malloc of a single struct hack. I would still need three malloc calls for three separate struct hacks.

Lastly, I'd be happy to provide the macros that generate that mess if anyone wants them.

This looks like complete overkill to me, did you really check that allocation is a bottleneck for you? This can only be relevant, if your arrays are relatively small. So why don't you use VLA? This would allocate everything on the stack, as local as you may want. — Jens Gustedt, Jan 12 '12 at 19:10
I'd not considered the VLA route because the arrays may be either very small or very large, because I may return p to a caller when suitably wrapped in a control struct, and because I may wish to use different malloc-compatible implementations. — Rhys Ulerich, Jan 12 '12 at 19:40
You didn't reply to my first question, why is all of this necessary? Did you measure it? — Jens Gustedt, Jan 12 '12 at 19:51
Strictly speaking necessary? No. Measured yet? No. Anticipated impact to my cache-bound simulation code (which runs for weeks on hundreds of processors)? Roughly half a percentage point. But the answer does change the way I'll write new logic. And I believe the question to be inherently interesting. — Rhys Ulerich, Jan 12 '12 at 20:13

Daniel Fischer · Accepted Answer · 2012-01-12T20:25:26.333

struct malloc_wrapper {
    char *storage;
    size_t capacity;
    size_t first_free;
};

initialise it:

void init_malloc_wrapper(struct malloc_wrapper *mw, size_t cap) { // or int
    if (!mw) {
        perror("No struct malloc_wrapper passed\n");
        exit(EXIT_FAILURE);
    }
    mw->storage = malloc(cap);
    if (mw->storage == NULL) {
        perror("allocation failed\n");
        exit(EXIT_FAILURE); // or return 0;
    }
    mw->capacity = cap;
    mw->first_free = 0;
    // return 1;
}

void clear_malloc_wrapper(struct malloc_wrapper *mw) {
    if (!mw) {
        return;
    }
    if (mw->storage) {
        free(mw->storage);
    }
    mw->storage = NULL;
    mw->capacity = 0;
    mw->first_free = 0;
}

deal out a chunk of memory:

// First version assumed that `malloc` returns memory aligned suitable for all requests.
// That's probably true, but just in case, we have to be a bit more cautious.
void *deal_out(struct malloc_wrapper *mw, size_t align, size_t size, size_t nmemb) {
    if (!align || (align & (align-1))) {
        perror("invalid alignment\n");
        exit(EXIT_FAILURE);
    }
    if (!mw) {
       perror("No malloc_wrapper\n");
        exit(EXIT_FAILURE);
    }
    uintptr_t start = ((uintptr_t(mw->storage) + mw->first_free + align-1) & ~(align-1);
    size_t offset = start - (uintptr_t)mw->storage;
    if ((mw->capacity - offset) < size*nmemb) {
        return NULL;   // not enough space
    }
    mw->first_free = offset + size*nmemb;
    return (void *)start;
}

Depending on paranoia a few more checks may be needed. With that you can deal out suitably aligned chunks as needed. But if you didn't allocate enough storage on initialisation, it's somewhat complicated to treat further requests. And freeing is all or nothing. So it may not be an improvement.

Edit: To deal with the problem of the initial capacity, one can add a list of used storage chunks, either only the pointers, or storage, offset and remaining capacity. If a new request exceeding the available capacity arrives, add current storage chunk to the list and allocate a new one. If remaining capacity of the used chunks is remembered, one can for smallish requests walk the list of old chunks to see whether it fits in one of those.

The first version of deal_out didn't account for the possibility of alignment demands larger than the alignment of malloc returns, fixed now.

I like the approach. I'll see what else is suggested but this looks quite nice. Only downside (as you point out) is that one needs to know cap in advance for a minimal (or near minimal) footprint. Presumably some multiple of a system blocksize that malloc can service quickly in init_malloc_wrapper() should suffice. Thank you. — Rhys Ulerich, Jan 12 '12 at 19:44
If you don't know the capacity in advance, you can keep a list of old chunks and allocate a new if needed. Also I fixed an assumption about the alignment. — Daniel Fischer, Jan 12 '12 at 20:27
If you provided the ideal capacity to your init_malloc_wrapper(), made everything static inline, allow reasonable optimizations, and squint the codepaths in my sample and your version look nearly identical. Yours is much, much cleaner. For my situation having a way to compute the capacity in advance to avoid secondary allocations and extra error checking is ideal (though I'd failed to spell that out in the original question). Thanks again Daniel. — Rhys Ulerich, Jan 13 '12 at 15:35

Canonical way to dice/carve/parcel a single malloc into multiple arrays with varying types/alignments?

1 Answers1