0

EDIT: I solved the problem and added the answer below for reference.

I am having a hard time finding information regarding my problem because I can't seem to find the right term for it.

I am creating a C# WPF app to compare and find the differences between 2 .csv files. I can already compare data and find the matches and differences, assuming the data in both is in the same format.

However, I also need to support differing, but similar sets as well.

Eg. (the files provided that I need to support, ideally this would work for other scenarios as well) need to be compared correctly as well.

The problem is that one of these sets has varying ways of formatting the data, since they are maintained by different people. The other set is always using the same format.

The set that uses the same format for everything, displays numbers in the format: "31612345678" or "3197123456789".

The set with varying data can have this data formatted in varying ways eg. "0612345678" "612345678" "31612345678" "06 12 34 56 78" "097 123456789" "09 71 23 45 67 89" etc.

Are there standardized steps to 'clean up' messy data like that and end up with 2 sets using the same format?

  • I would google something like "csv diff" – Clemens Jan 17 '22 at 09:53
  • 1
    I think the term you are looking for is [normalization](https://en.wikipedia.org/wiki/Canonical_form#Computing) Briefly, it's the process of turning different data formats into a common form. – InDieTasten Jan 17 '22 at 09:54
  • I will see what I can find using those, thanks :) – ChibiPhoenix Jan 17 '22 at 09:58
  • If you know the values should be a integer, it should be fairly simple to [remove all non-numeric characters](https://stackoverflow.com/questions/3977497/stripping-out-non-numeric-characters-in-string). – JonasH Jan 17 '22 at 10:05
  • 1
    *Are there standardized steps to 'clean up' messy data like that* - no; you have to imagine them and then code to cater for them. Alas, the problem is as old as.. er.. computers, because humans are inaccurate and unreliable and they push buttons connected to computers, which are accurate and reliable. This is what we develop *standards* for; for example, look at something like EDI - a solution to that recognized problem of "A wants to send data to B, but they each use different formats, and the whole thing is a gong show". Either A gets strict in sending, or B gets liberal in receiving – Caius Jard Jan 17 '22 at 10:06
  • When you say "'clean up' messy data" what do you mean? Do you want to convert the formatted numbers to a integer (`int64` or `BigInt`) for comparison? – Corey Jan 17 '22 at 10:12
  • @CaiusJard So basically you'd have to create support for a given scenario? In my case, I'd have to check which ones don't match (probably using Regex or the like) and turn them into the right form. However, if another type of data would have to be repaired I'd have to create a solution for that one specifically as well...? – ChibiPhoenix Jan 17 '22 at 10:13
  • @Corey They are in string format. I would like to end up with all items following the format "31612345687". Eg. turn "06 78 65 98 11" into "31678659811" I think I could do this using Regex. – ChibiPhoenix Jan 17 '22 at 10:19
  • So... Dutch phone numbers? https://regexr.com/3aevr – Corey Jan 17 '22 at 10:22
  • @Corey Yep! And that link is VERY helpfull thanks! Quite new to SO... how can I mark your comment as the answer? – ChibiPhoenix Jan 17 '22 at 10:22
  • Yes; you have some data, a million numbers, and 900K are fine, 100K are a mess. So you take out spaces.. Then only 1000 are a mess, so you take out hyphens, and clean up `ext`, `xtn`, `(e)` or whatever else the human has garbaged in to signify the extension number, and only 100 are a mess.. Then you replace the leading 0 with 31 and only 10 are a mess.. then you.. – Caius Jard Jan 17 '22 at 10:27
  • *how can I mark your comment as the answer?* - comments cannot be marked as answers. As this operation is probably a lot more involved than "run a regex for dutch phone numbers" (which really is a comment, not an answer), you should do the work, then write your own answer and post it - this way others benefit from what you did to clean up a messy dutch phone number. Or Corey can get involved and write up the comment into a considered, and useful answer (it will be longer than one line) so it can be accepted. For that to happen more details on the bad input data is needed first, really.. – Caius Jard Jan 17 '22 at 10:28
  • @CaiusJard Exactly what I am thinking of now :) Going to use RegEx to find faulty values (eg. ^0 to find the ones starting with a zero) then replace it to "31" etc. – ChibiPhoenix Jan 17 '22 at 10:30
  • "Data cleansing" is such a mammoth and ever-present task, that companies hire people who make it their sole function - there's an entire sector of industry just to deal with it :) - welcome to the introduction! – Caius Jard Jan 17 '22 at 10:31
  • @CaiusJard I am going to work on it, now I know which direction to go it is a lot easier to come up with a game-plan! I will update the post with my answer when I have a working solution :) – ChibiPhoenix Jan 17 '22 at 10:31

1 Answers1

0

My solution involves the use of the DataTable class, since they display the .csv data well when using the WPF DataGrid.

What needs to happen to fix the data of a given column:

  1. Remove any empty entries
  2. Remove any duplicates
  3. Replacing invalid entries with valid entries
  4. Repeat steps 2 and 3 untill the data is in the format you want it to be.

I ended up creating an extension class for the DataTable to tackle the above steps:

public static class MyDataTableExtensions
{
    /// <summary>
    /// Replaces the values in all rows in columnName using RegEx, can optionally remove any duplicate values as well.
    /// </summary>
    /// <param name="dt"></param>
    /// <param name="columnName"></param>
    /// <param name="pattern"></param>
    /// <param name="replacement"></param>
    /// <param name="removeDuplicates"></param>
    /// <returns></returns>
    public static DataTable ReplaceDataTableRows(this DataTable dt, string columnName, string pattern, string replacement, bool removeDuplicates = false)
    {
        if (dt is null || dt.AsEnumerable().Any() is false) throw new ArgumentException($"The DataTable '{ nameof(dt) }' does not contain any elements. Please make sure that the provided DataTable is not null or empty.");
        if (string.IsNullOrEmpty(columnName)) throw new ArgumentNullException(nameof(columnName));
        if (string.IsNullOrEmpty(pattern)) throw new ArgumentNullException(nameof(pattern));

        DataTable output = dt.Copy();
        Regex regex = new Regex(pattern);
        string item;

        foreach (DataRow row in output.Rows)
        {
            foreach (DataColumn column in output.Columns)
            {
                if (column.ColumnName.ToLower().Equals(columnName.ToLower()))
                {
                    if (column.ReadOnly) column.ReadOnly = false;
                    item = regex.Replace(row.Field<object>(column).ToString(), replacement);

                    if (string.IsNullOrEmpty(item))
                    {
                        row.Delete();
                        continue;
                    }

                    row.SetField(column, item);
                }
            }
        }

        output.AcceptChanges();
        return removeDuplicates ? RemoveDuplicatesFromDataTableColumn(output, columnName) : output;
    }

    /// <summary>
    /// Removes any empty values found in the rows in the given column
    /// </summary>
    /// <param name="dt"></param>
    /// <param name="columnName"></param>
    /// <returns></returns>
    public static DataTable RemoveEmptyValuesFromDataTableColumn(this DataTable dt, string columnName)
    {
        DataTable output = dt.Copy();

        foreach (DataRow row in output.Rows)
        {
            foreach (DataColumn column in output.Columns)
            {
                if (column.ColumnName.Equals(columnName))
                {
                    if (column.ReadOnly) column.ReadOnly = false;

                    if (string.IsNullOrEmpty(row.Field<object>(column).ToString()))
                    {
                        row.Delete();
                    }
                }
            }
        }

        output.AcceptChanges();
        return output;
    }

    /// <summary>
    /// Removes any duplicate rows found in the column with 'columnName'
    /// </summary>
    /// <param name="dt"></param>
    /// <param name="columnName"></param>
    /// <returns></returns>
    public static DataTable RemoveDuplicatesFromDataTableColumn(this DataTable dt, string columnName)
    {
        if (dt is null || dt.AsEnumerable().Any() is false) throw new ArgumentException($"The DataTable '{ nameof(dt) }' does not contain any elements. Please make sure that the provided DataTable is not null or empty.");
        if (string.IsNullOrEmpty(columnName)) throw new ArgumentNullException(nameof(columnName));

        return dt.AsEnumerable().GroupBy(x => x.Field<object>(columnName))
                .Select(x => x.First()).CopyToDataTable();
    }

    /// <summary>
    /// Prepends the valueToPrepend to the row in columnName if the value in the row does NOT match the given pattern
    /// </summary>
    /// <param name="dt"></param>
    /// <param name="columnName"></param>
    /// <param name="pattern"></param>
    /// <param name="valueToPrepend"></param>
    /// <returns></returns>
    public static DataTable PrependValueToDataTableColumn(this DataTable dt, string columnName, string pattern, string valueToPrepend)
    {
        if (dt is null || dt.AsEnumerable().Any() is false) throw new ArgumentException($"The DataTable '{ nameof(dt) }' does not contain any elements. Please make sure that the provided DataTable is not null or empty.");
        if (string.IsNullOrEmpty(columnName)) throw new ArgumentNullException(nameof(columnName));
        if (string.IsNullOrEmpty(pattern)) throw new ArgumentNullException(nameof(pattern));
        if (string.IsNullOrEmpty(valueToPrepend)) throw new ArgumentNullException(nameof(valueToPrepend));

        DataTable output = dt.Copy();
        Regex regex = new Regex(pattern);
        string item;

        foreach (DataRow row in output.Rows)
        {
            foreach (DataColumn column in output.Columns)
            {
                if (column.ColumnName.ToLower().Equals(columnName.ToLower()))
                {
                    if (column.ReadOnly) column.ReadOnly = false;
                    item = row.Field<object>(column).ToString();

                    if (regex.IsMatch(item) == false)
                    {
                        row.SetField(column, $"{ valueToPrepend }{ item }");
                    }
                }
            }
        }

        output.AcceptChanges();
        return output;
    }
}

After that, for my data-set, the solution is actually relatively simple:

    private void Button_Click(object sender, RoutedEventArgs e)
    { 
        // eventually the column will be selectable through the use of a ComboBox
        var selection = cb.SelectedItem;
        string columnName = selection is not null ? cb.SelectedItem.ToString() : "Objectgebondennummer";

        if (!string.IsNullOrEmpty(columnName))
        {
            // solution:                
            var resultDt = dt.Copy();                
            resultDt = resultDt.RemoveEmptyValuesFromDataTableColumn(columnName) // step 1
                       .RemoveDuplicatesFromDataTableColumn(columnName) // step 2
                       .ReplaceDataTableRows(columnName, "[^0-9]", string.Empty, true) // at this point we just have rows with mixed values, remove any non-numeric characters from the rows
                       .ReplaceDataTableRows(columnName, "^0", "31", true) // now we just have numbers (yay), first I replaced all the values starting with '0' to start with '31' instead
                       .PrependValueToDataTableColumn(columnName, "^31", "31") // my data was luckily pretty clean at this point, all that remained were numbers in the format "612345678" and "97123456789", so I just had to prepend 31 to them
                       .RemoveDuplicatesFromDataTableColumn(columnName); // make sure we don't have any new duplicates

            dataGrid.ItemsSource = resultDt.AsDataView(); // finally, update the datagrid with the fixed data!
        }
    }