When copying a file to HDFS, how to control what nodes that file will reside on?

Question

I'm dealing with kind of a bizarre use case where I need to make sure that File A is local to Machine A, File B is local to Machine B, etc. When copying a file to HDFS, is there a way to control which machines that file will reside on? I know that any given file will be replicated across three machines, but I need to be able to say "File A will DEFINITELY exist on Machine A". I don't really care about the other two machines -- they could be any machines on my cluster.

Thank you.

Given this requirement, it is possible that your overall architecture on HDFS is not correct. Bear in mind that *files don't go to a node*, blocks do. You can, however, play with **rack awareness** and **replication factor** for a (very) small cluster. I.e. making sure every block goes to every machine. In any case I don't see much of an advantage, and maybe exposing your use case a little deeper may get us better insights to help you. — xmar, Nov 13 '17 at 07:51

score 0 · Answer 1 · answered Apr 09 '13 at 22:27

0

I don't think so, because in general when the file is greater than 64MB(chunk size) the primary replicas of file chunks will reside on multiple servers.

answered Apr 09 '13 at 22:27

Sharvanath

457
4
15

The block size can be modified in the configuration settings easily this is not an impediment. [change block size](http://stackoverflow.com/questions/2669800/changing-the-block-size-of-a-dfs-file-in-hadoop) – Engineiro Apr 09 '13 at 22:35
Also, these files are small, less than 1MB – sangfroid Apr 09 '13 at 23:15
I mean conceptually, if the data can reside on multiple servers it's unlikely that one would not care about adding such an option. – Sharvanath Apr 11 '13 at 01:11

score 0 · Answer 2 · answered Apr 09 '13 at 23:41

HDFS is a distributed files system and HDFS is cluster (one machine or lots of machine) specific and once file is at HDFS you loose the machine or machines concept underneath. And that abstraction is what makes it best use case. If file size is bigger then replication block size the file will be cut into block size and based on replication factor, those blocks will be copied to other machine in your cluster. Those blocks move based on

In your case, if you have 3 nodes cluster (+1 main namenode), your source file size is 1 MB, your replication size is 64MB, and replication factor is 3, then you will have 3 copies of blocks in all 3 nodes consisting your 1MB file however from HDFS perspective you will still have only 1 file. Once file copies to HDFS, you really dont consider the machine factor because at machine level there is no file, it is file blocks.

If you really want to make sure for whatever reason, you can do is set the replication factor to 1 and have 1 node cluster which will guarantee your bizarre requirement.

Finally you can always use FSimage viewer tools in your Hadoop cluster to see where the file blocks are located. More details are located here.

score 0 · Answer 3 · answered Apr 11 '13 at 12:18

0

I found this recently that may address what you are looking to do: Controlling HDFS Block Placement

answered Apr 11 '13 at 12:18

Engineiro

1,146
7
10

When copying a file to HDFS, how to control what nodes that file will reside on?

3 Answers3