Finetune MongoDB performance. How DB relies on indexes, memory, I/O

MongoDB designed to provide perfect read operations performance and oriented for read access as much as possible. The main source to achieve these results is indexing. So when you query any data or making any aggregation MongoDB try to utilize indexes hardly, nevermind is it simple queries or complex aggregation pipelines (that’s why proper compound indexes could massively improve aggregation queries).

Let’s assume the following:

  1. We have a collection of posts with id, date, content
    {
        "id": …,
        "date": …,
        "content": …
    }
  2. At the moment X, we have only an initial index based on "id".
  3. Also, let’s think that our collection is 10 GB and it's run on 2 GB RAM server.

Queries based on not-indexed fields

In that case, every request which based on “date” or “content”, e.g.

db.Post.find({"date": {"$gt": "X"}})

will allow Mongo to match through every document in a collection, checking is it satisfy requirements or not. As we have 10 GB of data and only 2 GB of RAM Mongo will load 2 GB of data, check it, then swap on disk, load a new memory-mapped file with next 2 GB of data, then check it again, then swap, load, etc until it will check all of the data. I hope you catch the workflow pretty well.

Checking data itself is quite a fast operation, but memory swapping – it’s slow and time expensive. Also in this scenario, you could see a lot of I/O operation (reading from disk and swapping to disk). So this case is slow and I/O intensive – that’s clear.

Index-based fields queries

Let’s index all document based on "date". Now we have an additional "date" field-based index. Let's suggest that it’s size is 200 MB

Let execute "date" field request, e.g.

db.Post.find({"date": {"$gt": "X"}})

What will do Mongo in this case?

  1. Check if "date" is indexed field
  2. Find proper index
  3. Load it into memory
  4. Make all matching procedure only based on an index (already loaded in memory, no additional disk operations), and then just return document satisfied requirements.

At the second hit of queries like

db.Post.find({"date":{"$gt": "X"}})

it will work even faster, because date index already in memory, so no need to load it again.

Now you see that with the same server configuration and the same data we have fast performance and few I/O operations.

So the short answer for RAM vs. I/O dilemma is the following:

If you have proper indexes and have enough memory to load your indexes in memory – your system will be very fast and will take few I/O operation.

Move back to the real world

But we have more complex problems in real life. Except for simple queries we have compound queries like

db.Post.find({"date": {"$gt": "X"}, "content": "MongoDB rock!" })

And even if we have indexes for the most of fields you query for it does not solve the problem of compound queries. In this case, in order to avoid I/O operations, you need compound indexes, e.g.

{"date": 1, "content": 1}

So let’s get back to our example and suggest the following:

  1. Add index for "content" and compound index {"date": 1,"content": 1}.
  2. The size of the "content" index is 4 GB for example
  3. The size of the compound index is 4.5 GB

Let’s see how it works now.

In the previous example, we query db.Post.find({"date": {"$gt": "X"}}). Ok, we know what is going on. It loads data, look through an index, etc. Let’s check now what happens for query:

db.Post.find({"date": {"$gt": "X"}, "content": "MongoDB rock!"})
  1. So ok, this operation needs "date" and "content" field then we are looking for the corresponding index
  2. The index is found but it’s size 4.5 GB. We have only 1.8 GB of RAM (200 MB already spend for "date" index).
  3. We load 40% of the index, check it, then swap to disk, load 2nd 40% of the index, check it, swap, load check and we done.

So right now we have "date" index in memory, and last nearly last 30% of the compound index. And we already able to see some increasing of I/O operations. So then we hit queries like db.Post.find({ "content": "MongoDB rock!" }). We need "content" index for that, but we have 0 free memory, so we starting with swapping previously loaded indexes, then load "content" index, check it and returning results. So this case allows us to see that different indexes are competing for memory. And it’s getting worse with increasing numbers of indexes and collections. So more collections => more indexes => more competition for resources => more disks I/O => slower system.

Except all mentioned above, you should not forget about memory fragmentation. If you constantly load many differently sized files in RAM it will be fragmented with time, and you could see that you have 1 GB of free RAM, but MongoDB can’t use it, because it’s just spread in all address space with smaller 20-30MB chunks, but Mongo needs 500MB in a single chunk.

Conclusion

Taking into account all these cases I prepared some recommendation on how to improve MongoDB performance if you hesitating with RAM vs. I/O dilemma:

  1. From the hardware point. Both RAM and I/O increasing will cause a positive effect on system but it difficult to say what will get better increasing – I/O or RAM. It very depends on MongoDB usage pattern. But if have a chance to choosing between a couple more Gigabytes of RAM and SSD – SSD will be more effective in most of the cases.

  2. Multiple servers. In the question of spreading data across multiple servers, it will be better to have a small and middle-sized collection on the same server (not sharded across multiple servers), and only big collections sharded between multiple nodes. It reduces indexes per server number and will reduce resources completion between servers.

  3. Better to have nothing except MongoDB on the database server to avoid additional concurrency for resources. Verify the index model and make sure that it covers the most frequently used cases.

P.S. We are glad to hear back from you. If you have noticed any mistakes in the article or have suggestions about it please let us know.


Follow us to see the latest news, updates, and insights about technology, startups, software.