0

I have a Comments collection in Mongoose, and a query that returns the most recent five (an arbitrary number) Comments.

Every Comment is associated with another document. What I would like to do is make a query that returns the most recent 5 comments, with comments associated with the same other document combined.

So instead of a list like this:

results = [
    { _id: 123, associated: 12 },
    { _id: 122, associated: 8 },
    { _id: 121, associated: 12 },
    { _id: 120, associated: 12 },
    { _id: 119, associated: 17 }
]

I'd like to return a list like this:

results = [
    { _id: 124, associated: 3 },
    { _id: 125, associated: 19 }, 
    [
        { _id: 123, associated: 12 },
        { _id: 121, associated: 12 },
        { _id: 120, associated: 12 },
    ],
    { _id: 122, associated: 8 },
    { _id: 119, associated: 17 }
]

Please don't worry too much about the data format: it's just a sketch to try to show the sort of thing I want. I want a result set of a specified size, but with some results grouped according to some criterion.

Obviously one way to do this would be to just make the query, crawl and modify the results, then recursively make the query again until the result set is as long as desired. That way seems awkward. Is there a better way to go about this? I'm having trouble phrasing it in a Google search in a way that gets me anywhere near anyone who might have insight.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
Nate
  • 4,718
  • 2
  • 25
  • 26
  • The catch I see here is this. For your grouping on "12" here, where is the line that is drawn that says "stop listing results in this group here". The way I see it, from top to bottom at least one entry is within the "most recent 5", but what about the ones you are grouping. There needs to be a constraint such as no posts can be older than the oldest post. Otherwise the danger here is picking up lots of posts in any grouping. So what would be a rule? That suggestion perhaps or something else? – Neil Lunn Jan 10 '15 at 05:18
  • I'd rather not have a limit on the grouped items: imagine them as a collapsed drop down that could have any number of items in them without causing UI issues. – Nate Jan 10 '15 at 05:23
  • 1
    Trying to educate here. Think about it. If you sort by "associated user" and then by date descending ,then where do you stop before moving to the next user. If you sort by date and then by "associated user" then where do you stop again? After gathering how many possible groupings. There needs to be a line somewhere. People make lots of comments. – Neil Lunn Jan 10 '15 at 05:26
  • Okay: I don't object to some kind of limit. Given that, how could I approach this? – Nate Jan 10 '15 at 05:28

2 Answers2

2

Here's an aggregation pipeline query that will do what you are asking for:

db.comments.aggregate([
    { $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
    { $sort: { "maxID": -1 } },
    { $limit: 5 }
])

Lacking any other fields from the sample data to sort by, I used $_id.

If you'd like results that are a little closer in structure to the sample result set you provided you could add a $project to the end:

db.comments.aggregate([
    { $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
    { $sort: { "maxID": -1 } },
    { $limit: 5 },
    { $project: { _id: 0, cohorts: 1 }}
])

That will print only the result set. Note that even comments that do not share an association object will be in an array. It will be an array of 1 length.

If you are concerned about limiting the results in the grouping as Neil Lunn is suggesting, perhaps a $match in the beginning is a smart idea.

db.comments.aggregate([
    { $match: { createDate: { $gte: new Date(new Date() - 5 * 60000) } } },
    { $group: { _id: "$associated", maxID: { $max: "$_id"}, cohorts: { $push: "$$ROOT"}}},
    { $sort: { "maxID": -1 } },
    { $limit: 5 },
    { $project: { _id: 0, cohorts: 1 }}
])

That will only include comments made in the last 5 minutes assuming you have a createDate type field. If you do, you might also consider using that as the field to sort by instead of "_id". If you do not have a createDate type field, I'm not sure how best to limit the comments that are grouped as I do not know of a "current _id" in the way that there is a "current time".

NoOutlet
  • 1,949
  • 1
  • 14
  • 22
  • This is what I needed. My Mongo[oose] fu was too weak to recognize that aggregate could provide me with what I needed when I was scanning the documentation last night. Thanks for your help! – Nate Jan 10 '15 at 16:27
1

I honestly think you are asking a lot here and cannot really see the utility myself, but I'm always happy to have that explained to me if there is something useful I have missed.

Bottom line is you want comments from the last five distinct users by date, and then some sort of grouping of additional comments by those users. The last part is where I see difficulty in rules no matter how you want to attack this, but I'll try to keep this to the most brief form.

No way this happens in a single query of any sort. But there are things that can be done to make it an efficient server response:

var DataStore = require('nedb'),
    store = new DataStore();

async.waterfall(

    function(callback) {
        Comment.aggregate(
            [
                { "$match": { "postId": thisPostId } },
                { "$sort": { "associated": 1, "createdDate": -1 } },
                { "$group": {
                    "_id": "$associated",
                    "date": { "$first": "$createdDate" } 
                }},
                { "$sort": { "date": -1 } },
                { "$limit": 5 }
            ],
            callback);
    },

    function(docs,callback) {
        async.each(docs,function(doc,callback) {
            Comment.aggregate(
                [
                    { "$match": { "postId": thisPostId, "associated": doc._id } },
                    { "$sort": { "createdDate": -1 } },
                    { "$limit": 5 },
                    { "$group": { 
                        "_id": "$associated",
                        "docs": { 
                            "$push": {
                                "_id": "$_id", "createdDate": "$createdDate"
                            }
                        },
                        "firstDate": { "$first": "$createdDate" }
                    }}
                ],
                function(err,results) {
                    if (err) callback(err);
                    async.each(results,function(result,callback) {
                        store.insert( result, function(err, result) {
                            callback(err);
                        });
                    },function(err) {
                        callback(err);
                    });
                }
            );
        },
        callback);
    },

    function(err) {
        if (err) throw err;
        store.find({}).sort({ "firstDate": - 1 }).exec(function(err,docs) {
            if (err) throw err;
            console.log( JSON.stringify( docs, undefined, 4 ) );
        });
    }
);

Now I stuck more document properties in both the document and the array, but the simplified form based on your sample would then come out like this:

results = [
    { "_id": 3,  "docs": [124] },
    { "_id": 19, "docs": [125]  }, 
    { "_id": 12, "docs": [123,121,120] },
    { "_id": 8,  "docs": [122] },
    { "_id": 17, "docs": [119] }
]

So the essential idea is to first find your distinct "users" who where the last to comment by basically chopping off the last 5. Without filtering some kind of range here that would go over the entire collection to get those results, so it would be best to restrict this in some way, as in the last hour or last few hours or something sensible as required. Just add those conditions to the $match along with the current post that is associated with the comments.

Once you have those 5, then you want to get any possible "grouped" details for multiple comments by those users. Again, some sort of limit is generally advised for a timeframe, but as a general case this is just looking for the most recent comments by the user on the current post and restricting that to 5.

The execution here is done in parallel, which will use more resources but is fairly effective considering there are only 5 queries to run anyway. In contrast to your example output, the array here is inside the document result, and it contains the original document id values for each comment for reference. Any other content related to the document would be pushed into the array as well as required (ie The content of the comment).

The other little trick here is using nedb as a means for storing the output of each query in an "in memory" collection. This need only really be a standard hash data structure, but nedb gives you a way of doing that while maintaining the MongoDB statement form that you may be used to.

Once all results are obtained you just return them as your output, and sorted as shown to retain the order of who commented last. The actual comments are grouped in the array for each item and you can traverse this to output how you like.

Bottom line here is that you are asking for a compounded version of the "top N results problem", which is something often asked of MongoDB. I've written about ways to tackle this before to show how it's possible in a single aggregation pipeline stage, but it really is not practical for anything more than a relatively small result set.

If you really want to join in the insanity, then you can look at Mongodb aggregation $group, restrict length of array for one of the more detailed examples. But for my money, I would run on parallel queries any day. Node.js has the right sort of environment to support them, so you would be crazy to do it otherwise.

Community
  • 1
  • 1
Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
  • I'll take a step back: I want to have a widget in my app that displays recent comments. But I don't want that list to be overrun by the last 30 comments a single user made on a single game. I want it to be more representative of general recent activity by different users. Anyway, I really appreciate your taking the time to try to explain the situation better and give me some context. Thank you! – Nate Jan 10 '15 at 16:27
  • @Nate So it still seems to me that you would have to take this approach to truly get that. Whether it's pushing multiple comments made by a single user into an array or just keeping their first or last comment. But to get an "outer" 30 you need a distinct list first, then query for the details on those distinct users. Otherwise it's just guessing. You may be able to live with a guess, but it's not the question you asked. – Neil Lunn Jan 11 '15 at 00:52