I am managing a Hadoop cluster that is shared between a number of users. We frequently run jobs with extremely slow mappers. For example, we might have a 32 GB file of sentences (one sentence per line) that we want to NLP parse (which takes say 100 ms per sentence). If the block size is 128 MB, this is 250 mappers. This fills our rather small cluster (9 nodes times 12 mappers per node is 108 mappers) but each mapper takes a very long time to complete (hours).
The problem is that if the cluster is empty and such a job is started, it uses all of the mappers on the cluster. Then, if anyone else wants to run a short job, it is blocked for hours. I know that newer versions of Hadoop support preemption in the Fair Scheduler (we are using the Capacity Scheduler), but newer versions also are not stable (I'm anxiously awaiting the next release).
There used to be the option of specifying the number of mappers but now JobConf is deprecated (strangely, it is not deprecated in 0.20.205). This would alleviate the problem because, with more mappers, each map task would work on a smaller data set and thus finish sooner.
Is there any way around this problem in 0.20.203? Do I need to subclass my InputFormat (in this case TextInputFormat)? If so, what exactly do I need to specify?