The _replicator database is not scalable or my design needs tweaking

Question

I think it is important that I elaborate on where I am coming from so that you can understand my use case, please bear with me.

Background: I’m looking to migrate my app from CouchDB 1 to 2 and this migration is going to take a decent amount of work. I just want to double check that I’m not reinventing the wheel and make sure that there isn’t a better design to what I will elaborate on below, especially since CouchDB 2 appears to have some awesome new features.

Consider the following simplified use case for an app that allows students to submit quiz answers digitally. Each student should be able to submit her/his quiz answers and the teacher should be able to view all the answers. This design needs to work with PouchDB as PouchDB speaks directly to the DB and this saves us a lot of time as otherwise an elaborate set of APIs would need to be written.

My chosen design consists of one database per student and one database per teacher, i.e. a database per user. Only the owner of the database can edit her/his database and this is enforced via CouchDB roles. When a student submits an answer, it is synced with her/his database via PouchDB. The answers are then replicated to the teacher’s database. This in turn allows the students to quickly load their answers in the app and the teachers to load all the answers for all their students. Of course, there are views in the teacher databases that segment the answers by class, quiz, etc… so that the teacher doesn’t have to load the answers for all their students at once. If we didn’t have the teacher database then a teacher would need access to all the students’ databases and would have to sync with all of the their student’s databases.

At first glance, the _replicator database appears to be the the obvious way to replicate the data from the student databases to a single teacher database. The big gotcha is that when you use continuous replication, it consumes a file handle and a database connection which means that you can very quickly starve a database of its resources. For example, if we have say 10,000 students in our database then we need 10,000 concurrent file handles and database connections just for the replications. This is pretty crazy considering that it is unlikely that even say 100 of these 10,000 students would be using the app simultaneously.

Instead, I developed a service that listens to the _db_updates feed and then only replicates a database when there is a change to that specific database. With this method, we only worry about consuming resources when there are changes and as a result we end up with plenty of free file handles and database connections.

I’ve briefly experimented with CouchDB 2 and it appears that the _replicator database is just as greedy with resources as it was in CouchDB 1.

Is this database-per-user design for both students and teachers the best solution or is there a better solution? If it is the best solution, is there a better way of replicating this data that doesn’t consume as many resources?

Are you primarily syncing from PouchDB to CouchDB? Or are you doing local Couch<->Couch syncs? — Jonathan Hall, Apr 22 '17 at 14:56
Well, the replication from student to teacher DBs is CouchDB to CouchDB. The syncing with the mobile app is CouchDB to PouchDB — redgeoff, Apr 22 '17 at 19:58
I would consider doing the Couch-Couch replications with a cron job or similar, rather than having thousands of long-running replications. But the only way to know if it's problematic is probably to try it. — Jonathan Hall, Apr 23 '17 at 09:09
Thanks. This was my initial design and unfortunately, it is very slow and therefore leads to huge delays in the data being synced. This is problematic because users expect to see real-time changes. This is why I developed a service that listens to _db_updates and only syncs when there are changes. You can sort of think of this service as a more efficient cron job that only syncs when needed. — redgeoff, Apr 23 '17 at 13:51
Is it working for you? Or do you have a specific problem you need solved? — Jonathan Hall, Apr 23 '17 at 14:00
It is working, but it requires a significant amount of custom code. I just want to see if anyone has a better design that uses native CouchDB constructs before I go off and port my solution to work with CouchDB 2. In order to make my solution scalable and properly release it as open source, it is going to take a good chunk of work. — redgeoff, Apr 23 '17 at 14:03
Great... it's a good question. Thanks for the clarification. — Jonathan Hall, Apr 23 '17 at 14:17
A near future version of CouchDB will address this very issue natively: https://issues.apache.org/jira/browse/COUCHDB-3324 — Jan Lehnardt, Apr 24 '17 at 09:14
https://issues.apache.org/jira/browse/COUCHDB-3324 is really cool and I hope it hits soon. In the meantime, I've posted a design to https://github.com/redgeoff/spiegel that takes my CouchDB 1 design and ports it for CouchDB 2 and true scalability. It also adds the concept of "change listening." I'd love any feedback! — redgeoff, Apr 25 '17 at 03:53

score 0 · Accepted Answer · answered Jan 02 '18 at 14:29

I've open sourced my solution, called Spiegel, which provides the missing piece: scalable CouchDB replication and change listening. Spiegel is currently being used in production with a db-per-user design and is efficiently handling the replication of over 10,000 databases for Quizster.

The _replicator database is not scalable or my design needs tweaking

1 Answers1

Linked