0

I don't quite get the concept of block compression in hadoop. Let's say i have a 1Gb of Data that I want to write as block-compressed sequencefile and the default HDFS Blocksize of 128Mb.

Does it mean, my data gets split into 8 compressed blocks on HDFS and each of these Blocks can be decompressed later independently?

Joha
  • 935
  • 12
  • 32
  • This link can be helpful for this https://stackoverflow.com/questions/34953756/how-gzip-file-gets-stored-in-hdfs – Sandeep Singh Jun 29 '17 at 09:24
  • I don't think block compression in sequence file and HDFS blocks are related. Block compression is just one of the two types of compression provided by SeqFiles, the other being Record compression. Record comp. is one record at a time and block compr. is multiple records at a time. It doesn't have anything to do with your block size. I might be wrong. – philantrovert Jun 29 '17 at 10:00
  • That was exactly my question, thanks for pointing out. Could someone give me a reference that makes clear if block compression in a sequencefile is hdfs block related? – Joha Jun 29 '17 at 10:32
  • I've found the answer [here](https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/titlepage01.html). [...] "Also, the reference to block here is unrelated to the HDFS or filesystem block. A block in block compression refers to a group of records that are compressed together within a single HDFS block." – Joha Jun 29 '17 at 10:53

1 Answers1

0

That all depends on whether splittable is set or not. (For example, Gzip doesn't support splitting.)

Splittable means that hdfs blocks can be decompressed in parallel and blocks do not need to be co-located for sequencefile decompression.

Also, if you are using block compression, a compressed record may span multiple blocks which thus, again needs co-location to get decompressed.

So your blocks may or may not be decompressed independently.

Aakash Verma
  • 3,705
  • 5
  • 29
  • 66