9

How can I sort and limit each group by in mongoDB.

Consider below data:

Country:USA,name:xyz,rating:10,id:x
Country:USA,name:xyz,rating:10,id:y
Country:USA,name:xyz,rating:10,id:z
Country:USA,name:abc,rating:5,id:x
Country:India,name:xyz,rating:5,id:x
Country:India,name:xyz,rating:5,id:y
Country:India,name:abc,rating:10,id:z
Country:India,name:abc,rating:10,id:x

Now say I will group by country and sort by rating and limit the data of each group by 2.

so answer would be:

Country:USA
name:xyz,rating:10,id:x
name:xyz,rating:10,id:y
Country:India
name:abc,rating:10,id:x
name:abc,rating:10,id:z

I want to accomplish this using aggregate framework only.

I tried including sort in aggregate for rating but simply query turns no results after processing.

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135

1 Answers1

11

Your best option here is to run seperate queries for each "Country" ( ideally in parallel ) and return the combined results. The queries are quite simple, and just return the top 2 values after applying a sort on the rating value and will execute quite quickly even if you need to perform multiple queries to obtain the complete result.

The aggregation framework is not a good fit for this, now and even in the near future. The problem is there is no such operator that "limits" the result of any grouping in any way. So in order to do this, you basically need to $push all content into an array and extract the "top n" values from that.

The current operations needed to do that are pretty horrible, and the core problem is results are likely to exceed the BSON limit of 16MB per document on most real data sources.

Also there is an n complexity to this due to how you would have to do it right now. But just to demonstrate with 2 items:

db.collection.aggregate([
    // Sort content by country and rating
    { "$sort": { "Country": 1, "rating": -1 } },

    // Group by country and push all items, keeping first result
    { "$group": {
        "_id": "$Country",
        "results": {
            "$push": {
                "name": "$name", 
                "rating": "$rating",
                "id": "$id"
            }
        },
        "first": { 
            "$first": {
                "name": "$name", 
                "rating": "$rating",
                "id": "$id"
            }
        }
    }},

    // Unwind the array
    { "$unwind": "results" },

    // Remove the seen result from the array
    { "$redact": {
        "$cond": {
            "if": { "$eq": [ "$results.id", "$first.id" ] },
            "then": "$$PRUNE",
            "else": "$$KEEP"
        }
    }},

    // Group to return the second result which is now first on stack
    { "$group": {
        "_id": "$_id",
        "first": { "$first": "$first" },
        "second": { 
            "$first": {
                "name": "$results.name", 
                "rating": "$results.rating",
                "id": "$results.id"
            }
        }
    }},

    // Optionally put these in an array format
    { "$project": {
        "results": { 
            "$map": {
                "input": ["A","B"],
                "as": "el",
                "in": {
                    "$cond": {
                        "if": { "$eq": [ "$$el", "A" ] },
                        "then": "$first",
                        "else": "$second"
                    }
                }
            }
        }
    }}
])

That gets the result but its not a great approach and gets a lot more complex with iterations for higher limits or even where groupings have possibly less than n results to return in some cases.

The current development series ( 3.1.x ) as of writing has a $slice operator that makes this a bit more simple, but still has the same "size" pitfall:

db.collection.aggregate([
    // Sort content by country and rating
    { "$sort": { "Country": 1, "rating": -1 } },

    // Group by country and push all items, keeping first result
    { "$group": {
        "_id": "$Country",
        "results": {
            "$push": {
                "name": "$name", 
                "rating": "$rating",
                "id": "$id"
            }
        }
    }},
    { "$project": {
        "results": { "$slice": [ "$results", 2 ] }
    }}
])

But basically until the aggregation framework has some way to "limit" the number of items produced by $push or a similar grouping "limit" operator, then the aggregation framework is not really the optimal solution for this type of problem.

Simple queries like this:

db.collection.find({ "Country": "USA" }).sort({ "rating": -1 }).limit(1)

Run for each distinct country and ideally in parallel processing by event loop of thread with a combined result produces the most optimal approach right now. They only fetch what is needed, which is the big problem the aggregation framework cannot yet handle in such grouping.

So look for support to do this "combined query results" in the most optimal way for your chosen language instead, as it will be far less complex and much more performant than throwing this at the aggregation framework.

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135
  • Thanks for your detailed answer. It seems I am going with the last option which is easy to implement and most optimal one. –  Nov 01 '15 at 04:29
  • @ViyatGandhi That is the idea. The example of how it can be done is really only to demonstrate the problems that can occur. Separate queries get's this done without those problems. – Blakes Seven Nov 01 '15 at 05:29
  • 2
    Is this still relevant as of 2019? – HRK44 Mar 05 '19 at 15:33