Removing Special Characters from C++ String (Except ' and - )

Question

I'm trying to remove special characters from a string using an isWordChar() method. However, I need to keep two special characters, " ' " and " - ", such as the apostrophe in "isn't" and the hyphens in mother-in-law. Here's what I'm trying to implement:

std::string WordCount::stripWord(std::string word) { 

   for(unsigned int i = 0; i < wrd.size(); ++i)
   {
      if( !isWordChar(wrd[i]) && (wrd[i]!=39 && wrd[i]!=45))
      {
         wrd.erase(wrd.begin()+i);
         --i;
      }
   }

   return wrd;
}

After adding the special cases in my boolean, I can't get seem to correctly add the exception. Any hints or advice? Thanks!

The problem isn't clear. Explain what you mean by "I can't get seem to correctly add the exception." The presented code looks like it behaves just as you described, so what is it that's wrong? — bames53, Nov 09 '15 at 00:37
Also, don't use numbers in place of character literals, and standard algorithms from `` should be preferred over manual loops. — bames53, Nov 09 '15 at 00:38

score 3 · Answer 1 · answered Nov 09 '15 at 00:46

I would use the remove/erase idiom:

word.erase(std::remove_if(word.begin(),
    word.end(),
    [](char c) {
        return !(isWordChar(c) || '-' == c || '\'' == c);
    }), word.end());

The way you're erasing characters has complexity of approximately O(N * M) (where N is the original length of the string and M is the number of characters you remove). This has a complexity of approximately O(N), so if you're removing very many characters (or the string is very long) it's likely to give a substantial speed improvement.

If you care about why it's so much faster, it's because it works somewhat differently. To be specific, when you erase an element from the middle of a string, the erase function immediately copies all the letters after that to fill the hole where you erased the character. If you do this M times, all those characters get copied one for each character you remove.

When you use remove_if, it does something more like this:

template <class Iter, class F>
Iter remove_if(Iter b, iter e, F f)
    auto dest = word.begin();

    for (auto src=word.begin(); src != word.end(); ++src)
        if (!f(*src))
            *dst++ = *src;
        ++src;
    }
    return dst;
}

This way, each character that's retained is only copied once, rather than being copied every time you remove one character from the string. Then when you do the final erase, it just removes characters from the end of the string, so it's basically just adjusting the length of the string downward.

Anthony Calandra · Answer 2 · 2015-11-09T00:48:12.683

1

Your logic is incorrect. It should be: !isWordChar(wrd[i]) && wrd[i] != 39 && wrd[i] != 45. Read as: If the character isn't a word character, and it's not an apostrophe, and it's not a hyphen, do whatever is in the if-statement.

edited Nov 09 '15 at 00:48

answered Nov 08 '15 at 23:53

Anthony Calandra

1,429
2
11
17

I tried using the && and the || operators for the second half of the boolean to look like this : if(!isWordChar(wrd[i])&&(wrd[i]!=39&&wrd[i]!=45)) all with no luck... @ AnthonyCalandra – ewok896 Nov 08 '15 at 23:58
@ewok896 I don't understand. Did you try what I gave above? – Anthony Calandra Nov 08 '15 at 23:59
Sorry for the poor response. Yes, I tried it with the && operator, and it didn't work out – ewok896 Nov 09 '15 at 00:01
Thank you for your help, but get ready. I was foolishly working in a duplicate copy of my cpp on my editor and running my tests using the old cpp, not the new edited copy cpp. So, it works. Thanks again, and sorry for kind of wasting your time. – ewok896 Nov 09 '15 at 00:37
The recommended condition `!isWordChar(wrd[i]) && wrd[i] != 39 && wrd[i] != 45` is functionally equivalent to the condition `!isWordChar(wrd[i]) && (wrd[i]!=39 && wrd[i]!=45)` in the question. The comments about `erase` are ignoring the fact that there are [several overloads of `erase`](http://en.cppreference.com/w/cpp/string/basic_string/erase), one of which is `erase(iterator itr)` which does exactly the same thing as `erase(i, 1)`. – bames53 Nov 09 '15 at 00:43
@AnthonyCalandra Ah, I see his change. On `erase`, he is passing an iterator, so that part of his code is correct (even in the original version of the question). – bames53 Nov 09 '15 at 00:47

Removing Special Characters from C++ String (Except ' and - )

2 Answers2