6

I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter.

I want to split char *string based on a substring such as "abc" or any other sub-string. How that can be achieved?

Sourav Ghosh
  • 133,132
  • 16
  • 183
  • 261
Sadia Bashir
  • 75
  • 1
  • 1
  • 7
  • possible duplicate of [How to extract the string if we have have more than one delimiters?](http://stackoverflow.com/questions/22827998/how-to-extract-the-string-if-we-have-have-more-than-one-delimiters) – Jongware Apr 22 '15 at 06:35
  • I have one more query, how can I compare this str value in an if statement? for example if I have char *str = "abc" and I got a substring value from a long string and want to compare this substring value with *str: if(str == substr) – Sadia Bashir Apr 22 '15 at 07:40
  • got it, strcmp is used for this purpose! Thanks again everyone! – Sadia Bashir Apr 22 '15 at 07:46

5 Answers5

8

Finding the point at which the desired sequence occurs is pretty easy: strstr supports that:

char str[] = "this is abc a big abc input string abc to split up";
char *pos = strstr(str, "abc");

So, at that point, pos points to the first location of abc in the larger string. Here's where things get a little ugly. strtok has a nasty design where it 1) modifies the original string, and 2) stores a pointer to the "current" location in the string internally.

If we didn't mind doing roughly the same, we could do something like this:

char *multi_tok(char *input, char *delimiter) {
    static char *string;
    if (input != NULL)
        string = input;

    if (string == NULL)
        return string;

    char *end = strstr(string, delimiter);
    if (end == NULL) {
        char *temp = string;
        string = NULL;
        return temp;
    }

    char *temp = string;

    *end = '\0';
    string = end + strlen(delimiter);
    return temp;
}

This does work. For example:

int main() {
    char input [] = "this is abc a big abc input string abc to split up";

    char *token = multi_tok(input, "abc");

    while (token != NULL) {
        printf("%s\n", token);
        token = multi_tok(NULL, "abc");
    }
}

produces roughly the expected output:

this is
 a big
 input string
 to split up

Nonetheless, it's clumsy, difficult to make thread-safe (you have to make its internal string variable thread-local) and generally just a crappy design. Using (for one example) an interface something like strtok_r, we can fix at least the thread-safety issue:

typedef char *multi_tok_t;

char *multi_tok(char *input, multi_tok_t *string, char *delimiter) {
    if (input != NULL)
        *string = input;

    if (*string == NULL)
        return *string;

    char *end = strstr(*string, delimiter);
    if (end == NULL) {
        char *temp = *string;
        *string = NULL;
        return temp;
    }

    char *temp = *string;

    *end = '\0';
    *string = end + strlen(delimiter);
    return temp;
}

multi_tok_t init() { return NULL; }

int main() {
    multi_tok_t s=init();

    char input [] = "this is abc a big abc input string abc to split up";

    char *token = multi_tok(input, &s, "abc");

    while (token != NULL) {
        printf("%s\n", token);
        token = multi_tok(NULL, &s, "abc");
    }
}

I guess I'll leave it at that for now though--to get a really clean interface, we really want to reinvent something like coroutines, and that's probably a bit much to post here.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • How would one adopt this to do the same with an LPSTR string pointer? I know I can replace all the native string functions with their far pointer equivalents (_fstrlen etc.), but the input and output need to be LPSTR strings. – Tobias Timpe Feb 16 '21 at 22:12
  • @TobiasTimpe: `LPSTR` is just an alias for `char *`. – Jerry Coffin Feb 16 '21 at 22:18
  • Yes but I'm working on Win16 with LPSTR actually being a far pointer and I can't really convert the input to a normal char array because that would be too large. – Tobias Timpe Feb 16 '21 at 22:28
  • @TobiasTimpe: The equivalence mostly works in both directions, so if you change all the instances of `char *` in this code to `LPSTR`, you're probably pretty close to having it work (but I haven't had a Win16 SDK installed for years, so I can't test that). – Jerry Coffin Feb 16 '21 at 23:40
2

EDIT : Considered suggestions from Alan and Sourav and written a basic code for the same .

#include <stdio.h>

#include <string.h>

int main (void)
{
  char str[] = "This is abc test abc string";

  char* in = str;
  char *delim = "abc";
  char *token;

  do {

    token = strstr(in,delim);

    if (token) 
      *token = '\0';

    printf("%s\n",in);

    in = token+strlen(delim);

  }while(token!=NULL);


  return 0;
}
Santhosh Pai
  • 2,535
  • 8
  • 28
  • 49
  • You're very right, but i think that's not OP wants. he wants to cosider `"abc"` as a _single_ delimiter.:-) – Sourav Ghosh Apr 22 '15 at 06:09
  • That's fine, but the first part of your answer may be misleading. Please consider removing that. :-) – Sourav Ghosh Apr 22 '15 at 06:24
  • Don't think the second solution will work either. `strsep` doesn't look just for "abc". It looks for any permutation of the characters in "abc". Try this string in your program as an example: "This is bac test ac string". Probably need to use `strstr` instead. – kaylum Apr 22 '15 at 06:24
  • @AlanAu [Just to add my two cents] ...and `strsep()` is not standard C, IMHO. – Sourav Ghosh Apr 22 '15 at 06:36
  • @AlanAu : Have implemented the logic using strstr, thanks for the inputs. – Santhosh Pai Apr 22 '15 at 07:09
1

You can easlity write your own parser using strstr() to achieve the same. The basic algorithm may look like this

  • use strstr() to find the first occurrence of the whole delimiter string
  • mark the index
  • copy from starting till the marked index, that will be your expected token.
  • to parse the input for subsequent entries, adjust the strating of the initial string to advance by token length + length of the delimiter string.
Sourav Ghosh
  • 133,132
  • 16
  • 183
  • 261
1

I wrote an simple implementation that is thread-safe:

struct split_string {
    int len;
    char** str;
};
typedef struct split_string splitstr;
splitstr* split(char* string, char* delimiter) {
    int targetsize = 0;
    splitstr* ret = malloc(sizeof(splitstr));
    if (ret == NULL)
        return NULL;
    ret->str = NULL;
    ret->len = 0;
    char* pos;
    char* oldpos = string;
    int newsize;
    int dlen = strlen(delimiter);
    do {
        pos = strstr(oldpos, delimiter);
        if (pos) {
            newsize = pos - oldpos;
        } else {
            newsize = strlen(oldpos);
        }
        char* newstr = malloc(sizeof(char) * (newsize + 1));
        strncpy(newstr, oldpos, newsize);
        newstr[newsize] = '\0';
        oldpos = pos + dlen;
        ret->str = realloc(ret->str, (targetsize+1) * sizeof(char*));
        ret->str[targetsize++] = newstr;
        ret->len++;
    } while (pos != NULL);
    return ret;
}

To use:

splitstr* ret = split(contents, "\n");
for (int i = 0; i < ret->len; i++) {
    printf("Element %d: %s\n", i, ret->str[i]);
}
nift4
  • 58
  • 8
0

A modified strsep implementation that supports multi-bytes delimiter

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * Split a string into tokens
 * 
 * @in: The string to be searched
 * @delim: The string to search for as a delimiter
 */
char *strsep_m(char **in, const char *delim) {
  char *token = *in;

  if (token == NULL)
    return NULL;
    
  char *end = strstr(token, delim);
  
  if (end) {
    *end = '\0';
    end += strlen(delim);
  }
  
  *in = end;
  return token;
}

int main() {
  char input[] = "a##b##c";
  char delim[] = "##";
  char *token = NULL;
  char *cin = (char*)input;
  while ((token = strsep_m(&cin, delim)) != NULL) {
    printf("%s\n", token);
  }
}
Charley Wu
  • 801
  • 6
  • 5
  • Unlike `strtok()`, the code will produce 3 tokens for `"##foo##"`, which may or may not be expected. – chqrlie Mar 06 '22 at 13:35