-2

I have simple mongodb installation with two secondary and one primary nodes. When i run a mapreduce query on a datasize of 5 gb it takes same time which it was taking on a standalone mongodb installation on one node. I am using command line. Do I have to use any specific command to exploit extra replica sets for mapreduce?

Thank you in advance.

  • there is no way to scale your operations using replica sets. replica sets are for high availability and failover (plus redundancy of data) not for scaling. However, if you want to run mapReduce on a secondary, just connect to the secondary and specify rs.slaveOk() and then run mapReduce. It has to return results inline though since you cannot write to a secondary. – Asya Kamsky Aug 18 '14 at 21:37
  • I want to speedup mapreduce (reduce the query time), running on secondary is not my target. Should I try sharding and replication both to get achieve speedup? How can I reduce my mapreduce time from 10 sec to lets say 1 sec provided that I keep database size constant. – Ali Anwar Aug 18 '14 at 23:30
  • maybe you should query without mapreduce - mapreduce is inherently slow and should only be used when there is no other way to get the data you need. can you do the same query with aggregation framework or change your schema so that you only need a simple find? – Asya Kamsky Aug 19 '14 at 00:16
  • Yes, i can get same result using aggregation. I thought map reduce will be faster in distributed setup. How can I use aggregation with my setup efficiently I get same query time from standalone as well as three replica setup. – Ali Anwar Aug 19 '14 at 01:09
  • replica set does not give you scaling. it's not supposed to. you won't get faster time from replica set - that's expected. – Asya Kamsky Aug 19 '14 at 01:41
  • also see http://stackoverflow.com/questions/13908438/is-mongodb-aggregation-framework-faster-than-map-reduce/13912126#13912126 – Asya Kamsky Aug 19 '14 at 01:43
  • Do you have any suggestion that how can I reduce the query time of a single aggregation query provided that i can add as much sharded servers or replicas as much I want? – Ali Anwar Aug 19 '14 at 15:53
  • I have created a 10 node cluster --- 3 config servers, 1 mongos server, 3 node replica set as shard, and another 3 node replica set as a shard. ------------- I created a simple python script to spawn a 12 different connections with mongos and execute an aggregation query. Each query is taking around 35 sec to complete. ----------- I also have a standalone mongodb installation with same database and same python script takes 25 sec for each query when executed in parallel. ---------- Conclusion: I see performance degradation instead of speedup. Question: Should I add more shards to get speedup? – Ali Anwar Aug 19 '14 at 21:29
  • more shards will only help if your limiting factor is lack of parallel processing (only some phases of aggregations can be parallelized) - are you sure you shouldn't just be optimizing your schema or indexes? Anyway, google group for mongodb-user is much better venue for this discussion, not SO which is more Q&A forum. – Asya Kamsky Aug 21 '14 at 02:28
  • Thank you for the prompt replies. I figured out the solution.. All I had to do was to add shards with appropriate shard key. – Ali Anwar Aug 26 '14 at 03:10

1 Answers1

0

You can speed up your job if you can use aggregation framework instead of mapreduce - aggregation framework is a lot faster.

You can't really scale your operations using replica sets, since replica sets are for high availability and failover (plus redundancy of data) not for scaling. You can run mapReduce or aggregation on a secondary, just connect to the secondary and specify rs.slaveOk() and then run mapReduce/aggregate - but you cannot not output results to a collection then, since you cannot write to a secondary, so it has to return results inline.

This will move the extra load from the primary, but it won't make it faster per se. If you want to utilize multiple servers, you need to shard your database - by distributing the data over multiple shards/hosts you will automatically cause your mapReduce and/or aggregation queries to run over multiple servers - even though a small penalty will exist for managing the results (they have to be merged still) the longest part of the job will likely more than offset the extra overhead.

Community
  • 1
  • 1
Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133