9

I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:

key1:value1
key2:value2
key3:value3
...

This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?

Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.

Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.

Note: I do NOT need any whitespaces at all, even inside a key or a value.

Juan
  • 15,274
  • 23
  • 105
  • 187
  • 1
    The most natural solution (there are obviously a lot of correct solutions for such a simple problem) depends on how you read the data from the file. Can you post a relevant code snippet? – Jon Mar 14 '11 at 19:18
  • 1
    What do you mean with "invisible"? Are http://www.fileformat.info/info/unicode/char/200c/index.htm (ZERO WIDTH [NON] JOINER, a General Punctuation) or perhaps http://www.fileformat.info/info/unicode/char/202a/index.htm (LEFT-TO-RIGHT EMBEDDING another General Punctuation) invisible enough? :-) And the http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Mathematical_invisibles ... How could you live without? :-) :-) Here there is a question about visibility of U chars http://stackoverflow.com/questions/304483/determining-if-a-unicode-character-is-visible – xanatos Mar 14 '11 at 19:22

7 Answers7

18

I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:

static public string StripControlChars(this string s)
{
    return Regex.Replace(s, @"[^\x20-\x7F]", "");
}

Combined with the other RegEx examples already posted it should get you where you want to go.

Pete M
  • 2,008
  • 11
  • 17
8

If you use Regex (Regular Expressions) you can filter out all of that with one function.

string newVariable Regex.Replace(variable, @"\s", "");

That will remove whitespace, invisible chars, \n, and \r.

Kyle Uithoven
  • 2,414
  • 5
  • 30
  • 43
  • This will remove spaced from the keys and values as well. You might just want to remove control characters like \t, \n, \r and double spaces. – Paul Alexander Mar 14 '11 at 19:20
  • I believe he specifically said that he would like to deal with whitespace, tabs, as well as invisible characters, which includes control characters. – Kyle Uithoven Mar 14 '11 at 19:23
  • Control characters, yes, but a space may be a valid character in the value portion of the key/value pair. The OP doesn't specify, that's why it's just a comment to point out alternatives. – Paul Alexander Mar 14 '11 at 19:27
  • I think that will work AFTER splitting each line. Can you give me a link to any documentation related to that particular regex? – Juan Mar 14 '11 at 19:39
  • 1
    Never mind already found it here: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet. And yes that's what I needed. – Juan Mar 14 '11 at 19:40
  • It doesn't work with the "left-to-right mark" which is an invisible character http://www.fileformat.info/info/unicode/char/200e/index.htm – Benjamin Toueg Apr 04 '13 at 17:04
4

One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.

Regex.Replace(s, @"[^\x20-\x7F]", "")

should do that job.

Dave
  • 1,234
  • 13
  • 24
2
var split = textLine.Split(":").Select(s => s.Trim()).ToArray();

The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.

Dan Bryant
  • 27,329
  • 4
  • 56
  • 102
2

You can use string.Trim() to remove white-space characters:

var results = lines
        .Select(line => {
            var pair = line.Split(new[] {':'}, 2);
            return new {
                Key = pair[0].Trim(),
                Value = pair[1].Trim(),
            };
        }).ToList();

However, if you want to remove all white-spaces, you can use regular expressions:

var whiteSpaceRegex = new Regex(@"\s+", RegexOptions.Compiled);
var results = lines
        .Select(line => {
            var pair = line.Split(new[] {':'}, 2);
            return new {
                Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
                Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
            };
        }).ToList();
mgronber
  • 3,399
  • 16
  • 20
2

The requirements are too fuzzy. Consider:

"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?

These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.

Define a simple grammar and take out most of the guesswork.

"{key}":"{value}",

Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.

Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).

Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.

P.Brian.Mackey
  • 43,228
  • 68
  • 238
  • 348
  • Actually you are right. This would be my grammar: keys can be: "A-Za-z0-9", values can be: "A-Za-z0-9", key/value separator: ":", line separator: "\n". I think with that I can easily figure out some regular expressions to remove all unnecessary characters, perhaps by using the negation operator. – Juan Mar 14 '11 at 19:58
0

If it doesn't have to be fast, you could use LINQ:

string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720