25

My map is currently inefficient when parsing one particular set of files (a total of 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 MB). I can't find how to do it in the documentation for only one set of files and not the entire cluster.

Which command changes the block size when I upload? (Such as copying from local to dfs.)

4444
  • 3,541
  • 10
  • 32
  • 43
Sam
  • 757
  • 3
  • 11
  • 14
  • 2
    Not sure if/when the parameter changed, but it is now called "dfs.block.size". –  Jan 26 '11 at 00:27
  • Why don't you change the split size your map reduce job ? – ozw1z5rd Oct 05 '16 at 19:28
  • @ozw1z5rd AFAIK you can't change split size, or the number of splits. For MR2, it is dependent on your block size, and the number of splits is automatically computed on job submission. – ᐅdevrimbaris Dec 21 '16 at 18:55

5 Answers5

31

For me, I had to slightly change Bkkbrad's answer to get it to work with my setup, in case anyone else finds this question later on. I've got Hadoop 0.20 running on Ubuntu 10.10:

hadoop fs -D dfs.block.size=134217728 -put local_name remote_location

The setting for me is not fs.local.block.size but rather dfs.block.size

KWottrich
  • 453
  • 1
  • 6
  • 11
  • 7
    note the new change in hadoop 2.0.4: dfs.blocksize (http://hadoop.apache.org/docs/r2.0.4-alpha/hadoop-project-dist/hadoop-common/DeprecatedProperties.html) – Kiran Jun 07 '13 at 06:13
  • 1
    dfs.blocksize: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml – Shehaaz Jul 16 '13 at 20:42
14

I change my answer! You just need to set the fs.local.block.size configuration setting appropriately when you use the command line.

hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location

Original Answer

You can programatically specify the block size when you create a file with the Hadoop API. Unfortunately, you can't do this on the command line with the hadoop fs -put command. To do what you want, you'll have to write your own code to copy the local file to a remote location; it's not hard, just open a FileInputStream for the local file, create the remote OutputStream with FileSystem.create, and then use something like IOUtils.copy from Apache Commons IO to copy between the two streams.

Bkkbrad
  • 3,087
  • 24
  • 30
  • After using the command mentioned above, I tried to check the blocks in the file. hdfs fsck hvr.out1 -files -blocks -locations. hvr.out1 is my file. It looks like it has not split using the blocksize I specified. Instead it used the default blocksize – ForeverLearner Aug 21 '17 at 08:44
4

In conf/ folder we can change the value of dfs.block.size in configuration file hdfs-site.xml. In hadoop version 1.0 default size is 64MB and in version 2.0 default size is 128MB.

<property> 
    <name>dfs.block.size<name> 
    <value>134217728<value> 
    <description>Block size<description> 
<property>
RamenChef
  • 5,557
  • 11
  • 31
  • 43
madhur
  • 41
  • 1
3

you can also modify your block size in your programs like this

Configuration conf = new Configuration() ;

conf.set( "dfs.block.size", 128*1024*1024) ;
Mubashar
  • 12,300
  • 11
  • 66
  • 95
  • How will this configuration setting effect data that is already stored at a particular default blocksize? Hadoop 2.5.2 128MB? – Jeremy Hajek Nov 29 '16 at 05:29
3

We can change the block size using the property named dfs.block.size in the hdfs-site.xml file. Note: We should mention the size in bits. For example : 134217728 bits = 128 MB.

Rengasamy
  • 1,023
  • 1
  • 7
  • 21