-2

Many compression algorithms take advantage of the fact that there's redundancy/patterns in data. aaaaaaaaaabbbbbbbbbbbcccccccccccc could be compressed to 10'a'11'b'12'c', for example.

But that there's no more redundancy in my compressed data, I couldn't really compress it further. However, I can encrypt or encode it and turn it into a different string of bytes: xyzxyzxyzxyzxyz.

If the random bits just so happened to have a pattern in them, it seems that it'd be easy to take advantage of that: 5'xyz'

Here's what our flow looks like:

Original:         aaaaaaaaaabbbbbbbbbbbcccccccccccc
Compressed:       10'a'11'b'12'c'
Encrypted:        xyzxyzxyzxyzxyz
Compressed again: 5'xyz'

But the more data you have, the larger your file, the more effective many forms of encryption will be. Huffman encoding, especially, seems like it'd work really well on random bits of data, especially when the file gets pretty large!!

I imagine this would be atrocious when you need data fast, but I think it could have merits for storing archives, or other stuff like that. Maybe downloading a movie over a network would only take 1MB of bandwidth instead of 4MB. Then, you could unpack the movie as the download happened, getting the full 4MB file on your hard drive without destroying your network's bandwidth.

So I have a few questions:

  1. Do people ever encode data so it can be compressed better?

  2. Do people ever "double-compress" their data?

  3. Are there any well-known examples of "double" compression, where data is compressed, encrypted or encoded, then compressed again?

Ari Sweedler
  • 807
  • 7
  • 25
  • Why do you think Huffman encoding would be efficient for random data? It relies on the probability of some pieces of data occurring being different than the probability of other pieces of data occurring. That won't be the case for large pieces of truly random data. – EdmCoff Jan 30 '18 at 23:19
  • You're right, if the data is truly random, Huffman encoding wouldn't work at all. Here's my logic. Compression takes advantage of patterns in data, rendering it pattern-less. Encryption scrambles data, and since we're already pattern-less there're only the possibilities or remaining pattern-less, or introducing some pattern to the data. – Ari Sweedler Jan 30 '18 at 23:39
  • @EdmCoff I was confused: Once huffman-encoded, a message kinda looks like a string of random bits. But a string of random bits isn't efficient to huffman-encode. Worst-case, huffman encoding doesn't add any bits to the message (other than the header-table, is that large enough of an overhead to be relevant?). I suppose I find it tough to believe that EVERY time you encrypt/encode compressed data with EVERY algorithm, you'll be unable to find a pattern that you can exploit to compress with ANY compression algorithm. – Ari Sweedler Jan 30 '18 at 23:43

2 Answers2

3

Good encryption results in high-quality random data, so it can not be compressed. The probability of a compressible result "just so happened" from an encryption is the same as it would be from any other random data source. Which is simply never.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Yes, that makes sense for encryption. But .tar.gz's got me thinking about all of this, so that's why I included encryption. I suppose I should change my question to only mention encoding. So, encryption by definition won't help with this. But what about encoding? There's an infinite number of functions you can use to encode data, thus by the Pigeon Hole Principle, one method of encoding would HAVE to reduce the randomness (and increase the compress-ability) of your data, right? The only problem is... how expensive would it be to find that right encoding function. – Ari Sweedler Jan 31 '18 at 04:40
  • 2
    No. The problem is how many bits will it take to describe the encoding function. That has to be transmitted along with the data. Those bits will put you right back where you started. – Mark Adler Jan 31 '18 at 05:09
  • That's something that I was worried about, but if the file was REALLY really large, then my thoughts were that the fixed overhead (10s of Kilobytes, per compression) would hopefully be offset. Another issue with compressing like this is that you'd really have to "mine" to find a good encoding function... if you could measure a loss function based on how unpatterned the data is this would be an easier problem.. But uncompressing and decoding would be relatively cheap. This would really only be useful for mass distribution of large files to clients. Maybe something like Netflix could use this? – Ari Sweedler Jan 31 '18 at 05:49
  • 1
    There is nothing useful to use. This is a not a problem of how long it takes to find something. Even with infinitely fast and powerful computers, there is nothing that can be found to enable the compression of random data. – Mark Adler Jan 31 '18 at 05:54
  • I know you're an authority on this stuff, and I respect that a lot. But if you can use the SHA-256 algorithm to make a hash ending with 30+ zeros, then it seems like you could use a random encoding function to make your data less random. Maybe salting (<-- adding more overhead...) your data would make it more possible, or using strange encoding functions, but, without a proof, I find it hard to believe a proof that an infinitely fast and powerful computer couldn't twice-compress a massive and already once-compressed file. – Ari Sweedler Jan 31 '18 at 06:17
  • 1
    If there's an infinite number of encoding functions, and 10s of valid compression functions, and many many bytes of data in your file, all you'd need to gain a step is to encode your data to have 10 KB (overhead of a compression) of compressible data, then you could repeat. It seems like a fallacy that you can just keep compressing data, but the limit will be that each time you encode/compress, you have to record the inverse of the encoding/compression function somewhere into your file. But you could compress those along with the file. – Ari Sweedler Jan 31 '18 at 06:20
1

Double compression is like perpetual motion. It is a oft discussed idea but never works. If it worked, you could compress and compress and compress and get the file down to 1 bit... See How many times can a file be compressed?

The fundamental problem is that most files are NOT compressible--random, encrypted files even less so.

To answer your questions:

1) yes! See burrows wheeler compression

2) no.

3) no.

Ray
  • 2,974
  • 20
  • 26