algorithm to compress bunch of (string pairs)

Question

I have bunch of strings which come in pairs For example, these are three pairs. First one has two string "a", "1"

a,1
a,2
b,1

In my project, i need to compress this data. For the above data, compressed data is. (compressed to back uncompressed is cartesian product)

{a} <=> {1,2}
{b} <=> {1}

What is the best algorithm to come up with the optimal compressed data. Optimal here is fewer number of mappings I could write a simple script which creates a hash table with the first string. Even though it gives optimal compressed data in some cases (above case) but it doesnt always. For example,

a,1
a,2
b,1
c,2

Hash table with first string as a key would give me below compressed data

{a} <=> {1,2}
{b} <=> {1}
{c} <=> {2}

But that's not optimal compressed data. I will get below compression if i used second string

{a,b} <=> {1}
{a,c} <=> {2}

What is the best algorithm to compress the data?

Both of your examples will not produce the same output when you search. For example: you search for {a} you will get {1,2} but in the second you have to search for {a,b} to get {1,2}. (since its hashed you cant just search for {a}). If you don't know what to search for this could lead to a problem. — sebastian, Jan 11 '16 at 13:37
@kadoga - sorry, i didnt understand. These are not used for search, this is just compression — SAN, Jan 11 '16 at 13:39
Why not make two compressed mappings in parallel, and then decide which is the better one? Or are there any memory constraints? — Lav, Jan 11 '16 at 13:40
@Lav, The optimal solution would often be a mix of groupings by LHS and groupings by RHS. — ikegami, Jan 11 '16 at 13:41
@Lav, when i started thinking more, i felt there could be multiple solutions to this. Optimal result could come from completely different set. For example a,1;a,2;b,3;c,3 will need both to get {a} <=> {1,2}; {b, c} <=> {3} — SAN, Jan 11 '16 at 13:42
It's easy enough to create the `{keys} => {values}` mapping, and the transformation to create the inverted index, `{values} => {keys}`, is from the original mapping is also simple. Just create them both and then pick the shortest one. — Jim Mischel, Jan 11 '16 at 15:12
@Jim Mischel, Lav and user1697062 already suggested that, I've already pointed out it's broken, and SAN gave an example where that doesn't work. This is actually a very hard problem. — ikegami, Jan 11 '16 at 15:39
@SAN, Obviously, the first step is to split the data into connected groups: `a,1;a,2;b,3;c,3` ⇒ `a,1;a,2` + `b,3;c,3`. For that, you can use [this](http://pastebin.com/Tz1mPCDv). But breaking down the problem this way is only an optimization. You still have to compress the resulting groups in the same way you would have had to compress the whole set. For example, `a1,b1,b2,c1,c2,d2,d3,d4,e3,e4,e5` doesn't reduce, and includes LHS groupings and RHS groupings. One minimal solution: `abc<=>1;bcd<=>2;d<=>34;e<=>345` — ikegami, Jan 11 '16 at 15:45

score 1 · Answer 1 · edited May 23 '17 at 12:04

1

This problem is equivalent to covering the 1's or 0's of a binary matrix with a minimum number of lines.

For instance, consider the example : a1,b1,b2,c1,c2,d2,d3,d4,e3,e4,e5. This can be put in a matrix as follows:

  1 2 3 4 5
a 1
b 1 1
c 1 1
d   1 1 1
e     1 1 1

A minimal solution is to cross the lines d and e to give:

d : {2, 3, 4}
e : {3, 4, 5}

and the columns 1 and 2

1 : {a, b, c}
2 : {b, c, d}

so as to reach a minimum number of entries in the compressed map of 4.

The solution of this problem is treated elsewhere on stackoverflow in Hungarian Algorithm: How to cover 0 elements with minimum lines?

edited May 23 '17 at 12:04

Community

1
1

answered Jan 11 '16 at 13:57

kernel_panic

31
5

@ikegami, Sorry I made a mess out of this answer. Now I think this matrix analogy leads to a minimal number of entries in the mapping and therefore answers the question. What's your opinion? – kernel_panic Jan 12 '16 at 11:20
If it can handle the example I gave and that you used, I'm pretty sure it can handle anything. I'll make sure to read the linked page when I can. – ikegami Jan 12 '16 at 12:32

score 0 · Answer 2 · answered Jan 12 '16 at 15:25

This is definitely not speed-optimized, but this problem really captured my attention. I wasn't really sure about the input format, but that should be trivial to change. Here is my idea, basically I think that you need a two-pass, because you can't know which compression is best until you reach the end of your data:

use strict;
use warnings;

use Data::Dumper;

sub compress_strings {
    my @strings = @_;
    my %keys_hash;
    my %values_hash;
    my %second_hash;
    my %third_hash;

    foreach my $string (@strings) {
        my ($key, $value) = split /,/, $string;
        push(@{$keys_hash{$key}}, $value);
        push(@{$values_hash{$value}}, $key);
    }

    foreach my $key (keys(%keys_hash)) {
        my $found = 0;
        foreach my $value (@{$keys_hash{$key}}) {
            if (@{$values_hash{$value}} > 1) {
                if (!grep {$value} @{$second_hash{join(',', @{$values_hash{$value}})}}) {
                    push(@{$second_hash{join(',', @{$values_hash{$value}})}}, $value);
                }
                $found = 1;
            }
        }
        if (!$found) {
            $second_hash{$key} = $keys_hash{$key};
        }
    }

    %third_hash = map { $_ => join(',',@{$second_hash{$_}}) } keys(%second_hash);

    return \%third_hash;
}

print Dumper compress_strings( 'a,1', 'a,2', 'b,1', 'c,2', 'd,4' );
print Dumper compress_strings( 'a,1', 'a,2', 'b,3', 'c,3' );
print Dumper compress_strings( 'a,1', 'b,1' ,'b,2' ,'c,1' ,'c,2' ,'d,2' ,'d,3' ,'d,4' ,'e,3' ,'e,4' ,'e,5' );

Output:

$VAR1 = {
          'a,c' => '2',
          'a,b' => '1',
          'd' => '4'
        };
$VAR1 = {
          'a' => '1,2',
          'b,c' => '3'
        };
$VAR1 = {
          'a,b,c' => '1',
          'b,c,d' => '2',
          'd,e' => '3'
        };

algorithm to compress bunch of (string pairs)

2 Answers2