Is it advisable to use MapReduce to 'flatten' irregular entities in CouchDB?

Question

In a question on CouchDB I asked previously (Can you implement document joins using CouchDB 2.0 'Mango'?), the answer mentioned creating domain objects instead of storing relational data in Couch.

My use case, however, is not necessarily to store relational data in Couch but to flatten relational data. For example, I have the entity of Invoice that I collect from several suppliers. So I have two different schemas for that entity.

So I might end up with 2 docs in Couch that look like this:

{
    "type": "Invoice",
    "subType": "supplier B",
    "total": 22.5,
    "date": "10 Jan 2017",
    "customerName": "me"
}

{
    "type": "Invoice",
    "subType": "supplier A",
    "InvoiceTotal": 10.2,
    "OrderDate": <some other date format>,
    "customerName": "me"
}

I also have a doc like this:

{
    "type": "Customer",
    "name": "me",
    "details": "etc..."
}

My intention then is to 'flatten' the Invoice entities, and then join on the reduce function. So, the map function looks like this:

function(doc) {
    switch(doc.type) {
        case 'Customer':
            emit(doc.customerName, { doc information ..., type: "Customer" });
            break;
        case 'Invoice':
            switch (doc.subType) {
                case 'supplier B':
                    emit (doc.customerName, { total:  doc.total, date: doc.date, type: "Invoice"});
                    break;

                case 'supplier A':
                    emit (doc.customerName, { total:  doc.InvoiceTotal, date: doc.OrderDate, type: "Invoice"});
                    break;
            }
            break;
    }
}

Then I would use the reduce function to compare docs with the same customerName (i.e. a join).

Is this advisable using CouchDB? If not, why?

score 0 · Answer 1 · answered Apr 27 '17 at 07:01

First of all apologizes for getting back to you late, I thought I'd look at it directly but I haven't been on SO since we exchanged the other day.

Reduce functions should only be used to reduce scalar values, not to aggregate data. So you wouldn't use them to achieve things such as doing joins, or removing duplicates, but you would for example use them to compute the number of invoices per customer - you see the idea. The reason is you can only make weak assumptions with regards to the calls made to your reduce functions (order in which records are passed, rereduce parameter, etc...) so you can easily end up with serious performance problems.

But this is by design since the intended usage of reduce functions is to reduce scalar values. An easy way to think about it is to say that no filtering should ever happen in a reduce function, filtering and things such as checking keys should be done in map.

If you just want to compare docs with the same customer name you do not need a reduce function at all, you can query your view the following parameters:

startkey=["customerName"]
endkey=["customerName", {}]

Otherwise you may want to create a separate view to filter on customers first, and return their names and then use these names to query your view in a bulk manner using the keys view parameter. Startkey/endkey is good if you only want to filter one customer at a time, and/or need to match complex keys in a partial way.

If what you are after are the numbers, you may want to do :

if(doc.type == "Invoice") {
    emit([doc.customerName, doc.supplierName, doc.date], doc.amount)
}

And then use the _stats built-in reduce function to get statistics on the amount (sum, min, max,)

So that to get the amount spent with a supplier, you'd just need to make a reduce query to your view, and use the parameter group_level=2 to aggregate by the first 2 elements of the key. You can combine this with startkey and endkey to filter specific values of this key :

startkey=["name1", "supplierA"]
endkey=["name1", "supplierA", {}]

You can then build from this example to do things such as :

if(doc.type == "Invoice") {
    emit(["BY_DATE", doc.customerName, doc.date], doc.amount);
    emit(["BY_SUPPLIER", doc.customerName, doc.supplierName], doc.amount);
    emit(["BY_SUPPLIER_AND_DATE", doc.customerName, doc.supplierName, doc.date], doc.amount);
}

Hope this helps

Thanks. I'm not looking to query the database on a customer specifically, I'm looking to prepare BI 'eagerly'. So efficiency of reduce functions isn't really a primary concern - especially with CouchDB allowing for easy database shardingk — Zach Smith, Apr 27 '17 at 13:55
@reddy is right - don't use reduce functions for functionality other than aggregating simple numbers. It's not only about efficiency, you'll run into a host of problems (storage, rereduce etc.) when you do more than simple aggregation. — Bernhard Gschwantner, Apr 28 '17 at 15:43
So in other words, if I wanted to prepare derived datasets, I shouldn't use CouchDB - and should probably look at relational databases? — Zach Smith, Apr 28 '17 at 16:29
Why not use reduce functions for joining documents? storage and rereduce are not valid problems — Zach Smith, Apr 28 '17 at 16:35
If you are looking for valid reasons not to use reduce the way you suggest, try to think about the way your map/reduce functions are called. They are called every single time you make a change to a document, and the results of reduce is not cached in many scenarios so you'd in for the worst kind of troubles : a system working fine when you have a few dozens of thousands and document, and requiring emergency redesign just as it becomes seriously used and successful. — tobiak777, Apr 29 '17 at 06:22
The best reason is dogmatic though : the core devs tell you not to use it that way. Even if there weren't technical reason behind this recommendation I wouldn't play with fire, getting out of the intended usage is not fun at all. — tobiak777, Apr 29 '17 at 06:23
Regarding the gist, thanks for clarifying what you are triying to do. Unfortunately I don't think that by default CouchDb is a natural fit for being used in the middle of a relational pipeline (relational db of the left, BI on the right). I understand that you really are after are the replication features. You can make it work but you would need to do most of your joins in memory in the application layer. But since as you mention, it is to prepare for BI where performance is not an issue it should be OK. — tobiak777, Apr 29 '17 at 06:35
(That being said, now with Mango you should be able to do most of - if not all - your aggregation more easily by directly querying the DB) - http://docs.couchdb.org/en/2.0.0/api/database/find.html — tobiak777, Apr 29 '17 at 06:59

Bernhard Gschwantner · Answer 2 · 2017-05-04T16:48:31.370

It is totally ok to "normalize" your different schemas (or subTypes) via a view. You cannot create views based on those normalized schemas, though, and on the long run, it might be hard to manage different schemas.

The better solution might be to normalize the documents before writing them to CouchDB. If you still need the documents in their original schema, you can add a sub-property original where you store your documents in their original form. This would make working on data much easier:

{
  "type": "Invoice",
  "total": 22.5,
  "date": "2017-01-10T00:00:00.000Z",
  "customerName": "me",
  "original": {
    "supplier": "supplier B",
    "total": 22.5,
    "date": "10 Jan 2017",
    "customerName": "me"
  }
},

{
  "type": "Invoice",
  "total": 10.2,
  "date": "2017-01-12T00:00:00:00.000Z,
  "customerName": "me",
  "original": {
    "subType": "supplier A",
    "InvoiceTotal": 10.2,
    "OrderDate": <some other date format>,
    "customerName": "me"
  }
}

I d' also convert the date to ISO format because it parses well with new Date(), sorts correctly and is human-readable. You can easily emit invoices grouped by year, month, day and whatever with that.

Use reduce preferably only with built-in functions, because reduces have to be re-executed on queries, and executing JavaScript on many documents is a complex and time-intensive operation, even if the database has not changed at all. You find more information about the reduce process in the CouchDB process. It makes more sense to preprocess the documents as much as you can before storing them in CouchDB.

Clearly couchdb reduce is not only meant for built in functions, since you can in fact write custom reduce functions. I'm aware that a rereduce case has to be taken into account. Basically I'm looking at the possibility of BI with a schema-less database, with efficiency of index generation and storage being tradeoffs I'm willing to make. — Zach Smith, Apr 28 '17 at 16:36
The kind of answer i was hoping for was 'if you work with multiple large functions in your reduce, the index calculations will be intolerably slow BECAUSE... .... etc. Or some other (valid) reason for not making CouchDB work hard with intensive reduce functions — Zach Smith, Apr 28 '17 at 16:38
Thanks for your suggestion – I added the explanation why it is slow at the end. You find more info in the CouchDB docs. I hope you like my answer better now! What do you think of my suggestion to process the documents? — Bernhard Gschwantner, May 04 '17 at 16:49

Is it advisable to use MapReduce to 'flatten' irregular entities in CouchDB?

2 Answers2

Linked