MapReduce: Filter out key-value pairs if value is not above threshold

Question

Using MapReduce, how do you modify the following word count code such that it will only output words above a certain count threshold? (e.g. I want add some kind of filtering of key-value pairs.)

Input:

ant bee cat
bee cat dog
cat dog

Output: let say count threshold is 2 or more

cat 3
dog 2

Following code is from: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code

public static class Map1 extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());
      output.collect(word, one);
    }
  }
}

public static class Reduce1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

EDIT: RE: about inputs/testcase

Input file ("example.dat") and a simple test case ("testcase") is found here: https://github.com/csiu/tokens/tree/master/other/SO-26695749

EDIT:

The problem wasn't the code. It was due to some strange behavior between the org.apache.hadoop.mapred package. (Is it better to use the mapred or the mapreduce package to create a Hadoop Job?).

Point: use `org.apache.hadoop.mapreduce` instead

score 1 · Accepted Answer · answered Nov 02 '14 at 03:34

1

Try adding an if statement before collecting the output in reduce.

if(sum >= 2)
    output.collect(key, new IntWritable(sum));

answered Nov 02 '14 at 03:34

irrelephant

4,091
2
25
41

When I do something like this, I'm missing roughly half my expected outputs. Is it reasonable for the Reducer to not collect/emit a key-value pair? – csiu Nov 02 '14 at 03:43
No, that shouldn't happen. Could you post more details to the question? – irrelephant Nov 02 '14 at 03:45
When I tried what you suggested (on the actual input 'example.dat' -- see link above), I expected a count of 594 for word "0". But a count for this value was not returned when I set the threshold to 590. – csiu Nov 02 '14 at 04:23
Could you please construct a simple, easily verifiable test case? – irrelephant Nov 02 '14 at 04:29
I have created a simple test at: https://github.com/csiu/other/tree/master/SO-26695749/testcase .. there is also the TestCase.java code in which was run mapreduce – csiu Nov 02 '14 at 04:53
Interesting. If you run MR without the threshold filter, what happens? – irrelephant Nov 02 '14 at 04:56
When I set the threshold to `>= 1` (e.g. no filtering) I get the expected 107 key-value pairs – csiu Nov 02 '14 at 05:04
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/64105/discussion-between-irrelephant-and-csiu). – irrelephant Nov 02 '14 at 05:06

score 1 · Answer 2 · answered Nov 02 '14 at 03:34

1

You can just do the filtering in the Reduce1 class:

if (sum>=2) {
    output.collect(key. new IntWritable(sum));
}

answered Nov 02 '14 at 03:34

Andrew Luo

919
1
5
6

When I do something like this, I'm missing roughly half my expected outputs. Is it reasonable for the Reducer to not collect/emit a key-value pair? – csiu Nov 02 '14 at 03:43
Can you show some of the input lines that cause this problem? – Andrew Luo Nov 02 '14 at 03:55
Problem was discovered when I was doing a spot check for e.g. word "0" -- I expected a count of 594, but count was not returned when a threshold of 590 was set. – csiu Nov 02 '14 at 04:21

MapReduce: Filter out key-value pairs if value is not above threshold

Point: use org.apache.hadoop.mapreduce instead

2 Answers2

Point: use `org.apache.hadoop.mapreduce` instead