0

I have a MongoDB document with quite a large embedded array:

name : "my-dataset"
data : [ 
          {country : "A", province: "B", year : 1990, value: 200}
          ... 150 000 more 
       ]

Let us say I want to return data objects where country == "A".

  1. What is the proper way of doing this, for example via NodeJs?

  2. Given 150 000 entries with 200 matches, how long should the query take approximately?

  3. Would it be better (performance/structure wise) to store data as documents and the name as a property of each document?

  4. Would it be more efficient to use Mysql for this? )

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
dani
  • 4,880
  • 8
  • 55
  • 95
  • Okay. Now taking a breath here which question are you actually asking? Are you saying a "document" has 150,000 embedded array elements? Probably not a good design. Try to carefully explain what you are trying to do and make it sound like a single question. The solution is most likely to use a different structure, but proper answers will address what you need to do because you explained it properly. BTW. Just bolding text does not get you better answers. – Neil Lunn Sep 24 '14 at 09:03
  • 1
    4. From my experience, when comparing performance with mysql, results are mostly equivalent for simple queries like 'find me one record by this ID', and somewhat better with mysql for more complex sql queries with grouping etc. (aggregation framework with mongo) It was for some rather big amount of data (0.5 Tb). what seems more important, that mysql required 3 times less disk space to hold exactly same data, which indirectly may be the reason why it is faster (less disk reading). – Volodymyr Metlyakov Sep 24 '14 at 10:09
  • `explain()` is the best way to analyze performance in mongodb. Find more details in http://docs.mongodb.org/manual/reference/method/cursor.explain/#cursor.explain – Nirdesh Sharma Sep 24 '14 at 13:23

2 Answers2

1

A) Just find them with a query.

B) If the compound index {name:1, data.country:1} is built, the query should be fast. But you store all the data in one array, $unwind op has to be used. As a result, the query could be slow.

C) It will be better. If you store the data like:

{country : "A", province: "B", year : 1990, value: 200, name:"my-dataset"}
{country : "B", province: "B", year : 1990, value: 200, name:"my-dataset"}
...

With compound index {name:1, country:1}, the query time should be < 10ms.

D) MySQL vs MongoDB 1000 reads

Community
  • 1
  • 1
Mark_H
  • 770
  • 1
  • 6
  • 19
1

1.You can use the MongoDB aggregation :

db.collection.aggregate([
  {$match: {name: "my-dataset"}},
  {$unwind: "$data"},
  {$match: {"data.country": "A"}}
])

Will return a document for each data entry where the country is "A". If you want to regroup the datasets, add a $group stage :

db.collection.aggregate([
  {$match: {name: "my-dataset"}},
  {$unwind: "$data"},
  {$match: {"data.country": "A"}},
  {$group: {_id: "$_id", data: {$addToSet: "$data"}}}
])

(Didn't test it on a proper dataset, so it might be bugged)

2.150000 Subdocuments is still not a lot for mongodb, so if you're only querying on one dataset it should be pretty fast (the order of the millisecond).

3.As long as you are sure that your document is going to be smaller than 16MB (kinda hard to say), the maximum BSON document size), it should be fine, but the queries would be simpler if you stored your data as documents with the dataset name as a property, which is generally better for performances.

ploutch
  • 1,204
  • 10
  • 12