Archive for category MongoDB

MongoDB 3.x – How to get more details about a collection in MongoDB

MongoDB 3.x – How to get more details about a collection in MongoDB

In MongoDB shell

mongo> db.collection.stats()

{{ "ns" : "db.collection", 
"count" : 1474168, 
"size" : 1628955640, 
"avgObjSize" : 1105, 
"storageSize" : 62074880, 
"capped" : false, 
"wiredTiger" : 
     { "metadata" : { "formatVersion" : 1 }, 
     "creationString" : "access_pattern_hint=none,
          allocation_size=4KB,app_metadata=(formatVersion=1),
          block_allocation=best,block_compressor=zlib,cache_resident=false,
          checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),
          exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,
          ignore_in_memory_cache_size=false,immutable=false,internal_item_max=0,
          internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,
          key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,
          leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=true,bloom=true,
          bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldest=false,
          chunk_count_limit=0,chunk_max=5GB,chunk_size=10MB,merge_max=15,merge_min=0),
          memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,
          prefix_compression_min=4,source=,split_deepen_min_child=0,split_deepen_per_child=0,
          split_pct=90,type=file,value_format=u",
"type" : "file", 
"uri" : "statistics:table:collection/0-8813564358223606299", 
"nindexes" : 3, 
"totalIndexSize" : 83034112, 
"indexSizes" : { "_id_" : 72835072, "st_1_ih_1_vt_1_ua_1" : 5107712, "ea_1" : 5091328 }, 
"ok" : 1}

From the above, we can gather a few important things (highlighted in bold):

  1. There are 1474168 documents in the collections
  2. On disk, ignoring any filesystem compression, it is about 62074880 bytes
  3. It is using zlib compression
  4. On disk, the datafile will be $dbPath/collection/0-8813564358223606299.wt
  5. There are 3 indexes on this collection

Hope this helps!

Regards,
Wei Shan

Leave a comment

Thoughts on PostgreSQL JSON Vs MongoDB JSON Performance

Thoughts on PostgreSQL JSON Vs MongoDB JSON Performance

Recently, I came across a 3 year old blog post from EnterpriseDB about benchmarking JSON performance on MongoDB Vs PostgreSQL via a random Quora response of someone bashing MongoDB performance. I use both PostgreSQL and MongoDB on a daily basis at my current role. They are both great database and good at what they were built to do. However, the benchmarking results were absolutely unbelievable. I couldn’t believe it and had to read the source code myself.

There were a couple of reasons why the results were highly inaccurate.

INSERTs Test

It uses bash to generate massive INSERT shell scripts and inserts into mongod via Mongo shell. This means that it is not benchmarking PostgreSQL against MongoDB. It is benchmarking PostgreSQL psql against MongoDB Javescript shell (SpiderMoney/v8). That’s not a very fair benchmark is it?

SELECTs Test

Same issue here. It uses Mongo shell to run the SELECT queries. Effectively, the performance will be bounded by the speed of the javascript engine. It was using v8 back in version 2.6. (Link)

Data Loading Test

This is the only fair comparison so far. It compares PostgreSQL’s COPY against MongoDB’s mongoimport. It was indeed true that PosgreSQL was faster than MongoDB in this test.

Summary

I love using both PostgreSQL and MongoDB. However, this benchmarking test fails prove that PostgreSQL was indeed faster than MongoDB in JSON performance. Also, it did not share the configurations of postgresql.conf and mongod.conf. MongoDB doesn’t work well with ext4 filesystem. So if they are using AWS Linux AMI, it’s going to be on ext4 unless they took the effort the mount external EBS volumes.

Since the results are already 3 years old, I plan to benchmark PostgreSQL and MongoDB over the next few weeks using PostgreSQL 9.6 against MongoDB 3.4 with WT storage engine. Also, I don’t work for MongoDB nor EnterpriseDB, this will be a unbiased test 🙂

Regards,
Wei Shan

1 Comment

Top 3 Reasons to upgrade to MongoDB 3.4

Top 3 Reasons to upgrade to MongoDB 3.4

MongoDB 3.4 was released on Nov 29, 2016 with tons of new features. In this post, I will be listing the top 3 reasons to upgrade to MongoDB 3.4.

Faster Balancing

In the previous releases, it takes ages(4s per chunk, last I seen in 3.2.10) for the balancer to move or pre-split chunks. Imagine if you have 100,000 chunks to pre-split, it’s going to take days. There are 2 reason for this:

  1. In a sharded cluster, there can only be 1 chunk migration at any point of time. It’s a single threaded operation
  2. This.

In MongoDB 3.4, for a sharded cluster with n shards, MongoDB can perform at most n/2 (rounded down) simultaneous chunk migrations. So, if you have a 4 shard MongoDB cluster, it will migrate 2 chunks at a time! If you have lesser than 4 shards, the behaviour will stay the same unfortunately. 😦

Linearizable Read Concern

In general, MongoDB has various readConcern option. These options allow the developers to choose different readConcern suitable for their application requirement. Do use readConcern:local for unimportant queries that doesn’t care if it reads data that could be rolled back. Using readConcern:majority prevents reading data that can be rolled back. However, it does not prevent you from reading stale data in unique network partitions edge cases.

In MongoDB 3.4, using readConcern:linearizable allows you to prevent stale reads too. This feature is not free and has the highest performance overhead. readConcern:local only require 1 operation to return your results. readConcern:majority requires a majority acknowledgement from a majority of the nodes in your replica set which depends on your network latency. readConcern:linearizable requires an additional no-op majority write to the replica set to ensure the data being read is durable and not stale.

In essence, what readConcern:linearizable does is the following:

  1. Client sends query to primary node
  2. Node gets the results without erroors
  3. Node sends a noop majority write to the other nodes in the cluster to confirm that it is still the primary node.
  4. Results gets returned to the client

Network Compression

I’m not sure why this wasn’t publicise widely as a new feature. I like it a lot because, now all the traffic between mongos/mongod are compressed so we save on bandwidth. This really saves you lots of money if you are running it on AWS or if you are constrained by the network bandwidth.

Summary

In general, I do agree with infoworld that MongoDB 3.4 is a no-brainer upgrade! We are currently waiting on it to be available on our O/S of choice so we can start rolling it out!

Hope this helps! 🙂

Regards,
Wei Shan

1 Comment

Migrating Oracle Sample Schema to MongoDB

Migrating Oracle Sample Schema to MongoDB

It is not an easy feat to migrate your current application using Oracle or any relational database to MongoDB. You will need to remodel your schema to fit into MongoDB. If you are migrating from a relational database to another relational database, it is a lot easier (Still not that easy though!). In this blog post, I try to migrate the famous Oracle database sample schema to MongoDB. I’m not saying that from an application POV(think ACID!), this is a good thing, I’m illustrating the fact that you can’t just take a schema from a relational database and dump it onto MongoDB.

Let’s take a look at the ERD diagram of sales history schema.

sh schema

The sales history schema above represents the sales history for a company.

The sample company does a high volume of business, so it runs business statistics reports to aid in decision support. Many of these reports are time-based and nonvolatile. That is, they analyse past data trends. The company loads data into its data warehouse regularly to gather statistics for these reports. These reports include annual, quarterly, monthly, and weekly sales figures by product. The company also runs reports on distribution channels through which its sales are delivered. When the company runs special promotions on its products, it analyse the impact of the promotions on sales. It also analyse sales by geographical area.

Based on my understanding above, there’s a few assumptions to be made here:

  1. This is a data warehouse schema, not OLTP workload
  2. The application query distribution channels through which its sales are delivered.
  3. The application query about promotion impact on sales
  4. The application query sales history by geographical area

Rules to follow in MongoDB data modelling

MongoDB understands that people who are new to NoSQL would not understand how to data model efficiently. They have an entire chapter of documentation dedicated to it! Some of the limitations or rules you should be aware of:

MongoDB “Sales History” Schema – Version 1

Based on the MongoDB official documentation in data modelling, the following would be good enough.

SALES

{
 "_id": {
 "PROD": {},
  "PROD_ID": {},
  "PROD_NAME": {},
  "PROD_DESC": {},
  "PROD_SUBCATEGORY": {},
  "PROD_SUBCATEGORY_ID": {},
  "PROD_SUBCATEGORY_DESC": {},
  "PROD_CATEGORY": {},
  "PROD_CATEGORY_ID": {},
  "PROD_CATEGORY_DESC": {},
  "PROD_WEIGHT_CLASS": {},
  "PROD_UNIT_OF_MEASURE": {},
  "PROD_PACK_SIZE": {},
  "PROD_SUPPLIER_ID": {},
  "PROD_STATUS": {},
  "PROD_LIST_PRICE": {},
  "PROD_MIN_PRICE": {},
  "PROD_TOTAL": {},
  "PROD_TOTAL_ID": {},
  "PROD_SRC_ID": {},
  "PROD_EFF_FROM": {},
  "PROD_EFF_TO": {},
  "PROD_VALID": {},
  "PROD_COST": {
  "TIME_DESC": {},
  "UNIT_COST": {},
  "UNIT_PRICE": {}
 }
 },
 "CUST": {
  "CUST_ID": {},
  "CUST_FIRST_NAME": {},
  "CUST_LAST_NAME": {},
  "CUST_GENDER": {},
  "CUST_YEAR_OF_BIRTH": {},
  "CUST_MARITAL_STATUS": {},
  "CUST_STREET_ADDRESS": {},
  "CUST_POSTAL_CODE": {},
  "CUST_CITY": {},
  "CUST_CITY_ID": {},
  "CUST_STATE_PROVINCE": {},
  "CUST_STATE_PROVINCE_ID": {},
  "COUNTRY": {
  "COUNTRY_ID": {},
   "COUNTRY_ISO_CODE": {},
   "COUNTRY_NAME": {},
   "COUNTRY_SUBREGION": {},
   "COUNTRY_SUBREGION_ID": {},
   "COUNTRY_REGION": {},
   "COUNTRY_REGION_ID": {},
   "COUNTRY_TOTAL": {},
   "COUNTRY_TOTAL_ID": {},
   "COUNTRY_NAME_HIST": {}
 },
  "CUST_MAIN_PHONE_NUMBER": {},
  "CUST_INCOME_LEVEL": {},
  "CUST_CREDIT_LIMIT": {},
  "CUST_EMAIL": {},
  "CUST_TOTAL": {},
  "CUST_TOTAL_ID": {},
  "CUST_SRC_ID": {},
  "CUST_EFF_FROM": {},
  "CUST_EFF_TO": {},
  "CUST_VALID": {}
 },
 "TIME": {
  "TIME_ID": {},
  "DAY_NAME": {},
  "DAY_NUMBER_IN_WEEK": {},
  "DAY_NUMBER_IN_MONTH": {},
  "CALENDAR_WEEK_NUMBER": {},
  "FISCAL_WEEK_NUMBER": {},
  "WEEK_ENDING_DAY": {},
  "WEEK_ENDING_DAY_ID": {},
  "CALENDAR_MONTH_NUMBER": {},
  "FISCAL_MONTH_NUMBER": {},
  "CALENDAR_MONTH_DESC": {},
  "CALENDAR_MONTH_ID": {},
  "FISCAL_MONTH_DESC": {},
  "FISCAL_MONTH_ID": {},
  "DAYS_IN_CAL_MONTH": {},
  "DAYS_IN_FIS_MONTH": {},
  "END_OF_CAL_MONTH": {},
  "END_OF_FIS_MONTH": {},
  "CALENDAR_MONTH_NAME": {},
  "FISCAL_MONTH_NAME": {},
  "CALENDAR_QUARTER_DESC": {},
  "CALENBDAR_QUARTER_ID": {},
  "FISCAL_QUARTER_DESC": {},
  "FISCAL_QUARTER_ID": {},
  "DAYS_IN_CAL_QUARTER": {},
  "DAYS_IN_FIS_QUARTER": {},
  "END_OF_CAL_QUARTER": {},
  "END_OF_FIS_QUARTER": {},
  "CALENDAR_QUARTER_NUMBER": {},
  "FISCAL_QUARTER_NUMBER": {},
  "CALENDAR_YEAR": {},
  "CALENDAR_YEAR_ID": {},
  "FISCAL_YEAR": {},
  "FISCAL_YEAR_ID": {},
  "DAYS_IN_CAL_YEAR": {},
  "DAYS_IN_FIS_YEAR": {},
  "END_OF_CAL_YEAR": {},
  "END_OF_FIS_YEAR": {}
 },
 "CHANNEL": {
  "CHANNEL_ID": {},
  "CHANNEL_DESC": {},
  "CHANNEL_CLASS": {},
  "CHANNEL_CLASS_ID": {},
  "CHANNEL_TOTAL": {},
  "CHANNEL_TOTAL_ID": {}
 },
 "PROMO": {
  "PROMO_ID": {},
  "PROMO_NAME": {},
  "PROMO_SUBCATEGORY": {},
  "PROMO_SUBCATEGORY_ID": {},
  "PROMO_CATEGORY": {},
  "PROMO_CATEGORY_ID": {},
  "PROMO_COST": {},
  "PROMO_BEGIN_DATE": {},
  "PROMO_END_DATE": {},
  "PROMO_TOTAL": {},
  "PROMO_TOTAL_ID": {}
 },
 "QUANTITY_SOLD": {},
 "AMOUNT_SOLD": {}
}

However, this design pattern will result into 1 big massive JSON document that is slow. It is also more likely to encounter the MongoDB limitations listed earlier. What is a better approach then?

MongoDB “Sales History” Schema – Version 2

Let’s go with the easy bits first.

CUSTOMER

{
 "_id": {},
 "CUST_FIRST_NAME": {},
 "CUST_LAST_NAME": {},
 "CUST_GENDER": {},
 "CUST_YEAR_OF_BIRTH": {},
 "CUST_MARITAL_STATUS": {},
 "CUST_STREET_ADDRESS": {},
 "CUST_POSTAL_CODE": {},
 "CUST_CITY": {},
 "CUST_STATE_PROVINCE": {},
 "COUNTRY": {
  "COUNTRY_NAME": {},
  "COUNTRY_ISO_CODE": {},
  "COUNTRY_SUBREGION": {},
  "COUNTRY_REGION": {},
  "COUNTRY_TOTAL": {},
  "COUNTRY_NAME_HIST": {}
 },
 "CUST_MAIN_PHONE_NUMBER": {},
 "CUST_INCOME_LEVEL": {},
 "CUST_CREDIT_LIMIT": {},
 "CUST_EMAIL": {},
 "CUST_TOTAL": {},
 "CUST_SRC_ID": {},
 "CUST_EFF_FROM": {},
 "CUST_EFF_TO": {},
 "CUST_VALID": {}
}

Simply follow this one to one relationship design pattern for the CUSTOMER collection.

PRODUCT

{
 "_id": {},
 "PROD_NAME": {},
 "PROD_DESC": {},
 "PROD_SUBCATEGORY": {
  "PROD_SUBCATEGORY_ID": {},
  "PROD_SUBCATEGORY_DESC": {}
 },
 "PROD_CATEGORY": {
  "PROD_CATEGORY_ID": {},
  "PROD_CATEGORY_DESC": {}
 },
 "PROD_WEIGHT_CLASS": {},
 "PROD_UNIT_OF_MEASURE": {},
 "PROD_PACK_SIZE": {},
 "PROD_SUPPLIER_ID": {},
 "PROD_STATUS": {},
 "PROD_LIST_PRICE": {},
 "PROD_MIN_PRICE": {},
 "PROD_TOTAL": {},
 "PROD_TOTAL_ID": {},
 "PROD_SRC_ID": {},
 "PROD_EFF_FROM": {},
 "PROD_EFF_TO": {},
 "PROD_VALID": {},
 "COST": {
  "TIME_ID": {},
  "PROMO_ID": {},
  "CHANNEL_ID": {},
  "UNIT_COST": {},
  "UNIT_PRICE": {}
 }
}

PROMOTION

{
 "_id": {},
 "PROMO_NAME": {},
 "PROMO_SUBCATEGORY": {},
 "PROMO_CATEGORY": {},
 "PROMO_COST": {},
 "PROMO_BEGIN_DATE": {},
 "PROMO_END_DATE": {},
 "PROMO_TOTAL": {}
}

I have chose to combine PRODUCTS and COSTS into a single PRODUCT collection. You may find this design pattern over here. I did not choose to have the PROMOTION in the same collection. Instead, I decided to use references. The reason is because a product might have hundreds or thousands of promotion, an embedded document here would not be scalable for the PRODUCT collection.

CHANNELS

{
 "_id": {},
 "CHANNEL_DESC": {},
 "CHANNEL_CLASS": {},
 "CHANNEL_CLASS_ID": {},
 "CHANNEL_TOTAL": {},
 "CHANNEL_TOTAL_ID": {}
}

TIMES

{
 "_id": {},
 "DAY_NAME": {},
 "DAY_NUMBER_IN_WEEK": {},
 "DAY_NUMBER_IN_MONTH": {},
 "CALENDAR_WEEK_NUMBER": {},
 "FISCAL_WEEK_NUMBER": {},
 "WEEK_ENDING_DAY": {},
 "WEEK_ENDING_DAY_ID": {},
 "CALENDAR_MONTH_NUMBER": {},
 "FISCAL_MONTH_NUMBER": {},
 "CALENDAR_MONTH_DESC": {},
 "CALENDAR_MONTH_ID": {},
 "FISCAL_MONTH_DESC": {},
 "FISCAL_MONTH_ID": {},
 "DAYS_IN_CAL_MONTH": {},
 "DAYS_IN_FIS_MONTH": {},
 "END_OF_CAL_MONTH": {},
 "END_OF_FIS_MONTH": {},
 "CALENDAR_MONTH_NAME": {},
 "FISCAL_MONTH_NAME": {},
 "CALENDAR_QUARTER_DESC": {},
 "CALENDAR_QUARTER_ID": {},
 "FISCAL_QUARTER_DESC": {},
 "FISCAL_QUARTER_ID": {},
 "DAYS_IN_CAL_QUARTER": {},
 "DAYS_IN_FIS_QUARTER": {},
 "END_OF_CAL_QUARTER": {},
 "END_OF_FIS_QUARTER": {},
 "CALENDAR_QUARTER_NUMBER": {},
 "FISCAL_QUARTER_NUMBER": {},
 "CALENDAR_YEAR": {},
 "CALENDAR_YEAR_ID": {},
 "FISCAL_YEAR": {},
 "FISCAL_YEAR_ID": {},
 "DAYS_IN_CAL_YEAR": {},
 "DAYS_IN_FIS_YEAR": {},
 "END_OF_CAL_YEAR": {},
 "END_OF_FIS_YEAR": {}
}

Honestly, I’m not too happy to keep TIMES and CHANNEL collection separate. However, if I try to embed them in other collection (COST/SALES), I would not be able to create a new TIMES or CHANNEL document without having a COST or SALES for it.

SALES

{
 "PROD_ID": {},
 "CUST_ID": {},
 "TIME_ID": {},
 "CHANNEL_ID": {},
 "PROMO_ID": {},
 "QUANTITY_SOLD": {},
 "AMOUNT_SOLD": {}
}

I would shard the SALES collection on TIME_ID, CHANNEL_ID and PROMO_ID. Based on the assumptions previously, I would assume that majority of the query will filter by TIME_ID, PROMO_ID and CHANNEL_ID.

Summary

On the example above, I assumed that most of the queries will be revolving around SALES collection. On various collections above, I tried to guess the collection growth size and decide if I should embed the collection or use ID reference instead. Obviously, if my assumptions are wrong, the queries would be sub-optimal.

As you might have noticed, just because MongoDB doesn’t have a schema restriction, it doesn’t mean that you don’t need data modelling. In fact, it’s a paradigm shift for schema design between relational database and MongoDB. In MongoDB, you need to design your schema based on the type of questions that your application will be asking.

Now, if you are SQL developer, you will notice that there will be a need to join some of the collections together for BI reports. If so, you will need to do that at the application layer. If you are looking for a single left outer join, use $lookup instead.

At the end of the day, MongoDB is not suitable for all application! Yes, you can now use $Graphlook for finding relationships between objects. But you wouldn’t want to based your brand new Uber-For-Facebook(the friend suggestion feed) on MongoDB. It is clearly not designed to do that. In the era of NoSQL, use the right database for the correct problem.

References:

P.S: Do let you know if you have any other ideas on modelling the sample schema. I would love to hear your thoughts on this!

Regards,
Wei Shan

Leave a comment

Compound Index Vs Multiple Single Field Index in MongoDB 3.x – Part II

Compound Index Vs Multiple Single Field Index in MongoDB 3.x – Part II

After the first blog post was published, it was suggested to me that I should run the test with w:1 j:true. It make sense since no one would or should be running workloads with w:1, j:false.

By definition:

With j: true, MongoDB returns only after the requested number of members, including the primary, have written to the journal. Previously j: true write concern in a replica set only requires the primary to write to the journal, regardless of the w: <value> write concern.

If you are using MMapv1, then you should be using j:true else you risk data loss. If you are using WiredTiger, you will only lose at most 60 seconds of data between the checkpoints.

Insert Performance

No Index

After 295 seconds, 5754368 new records inserted - collection has 5754368 in total
20017 inserts per second since last report 99.62 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~19,763.2 inserts/second

Compound Index

After 295 seconds, 4497408 new records inserted - collection has 4497408 in total
14131 inserts per second since last report 98.63 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~15,245.4 inserts/second (22% performance reduction)

Single-Field Index

After 295 seconds, 2438144 new records inserted - collection has 2438144 in total
6758 inserts per second since last report 35.55 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~8,264.8 inserts/second (58% performance reduction)

Summary

The drop in performance ratio has reduced from when j:false. Previously it was about 40% performance drop with compound index and 78% performance drop with single-field indexes. Turns out, compound index isn’t too bad actually, taking into consideration that I created a 9 field compound index on the collection.

Regards,
Wei Shan

Leave a comment

Compound Index Vs Multiple Single Field Index in MongoDB 3.x

Compound Index Vs Multiple Single Field Index in MongoDB 3.x

MongoDB supports both compound index and single field index. It’s a common knowledge that index decrease your INSERTS/UPDATES performance but increases your SELECT performance. However, if you need to index multiple fields, is it better to use a compound index or just create multiple single field indexes? This post tries to investigate which is the better option in terms of performance.

Insert Performance

Test Setup:

  • Single MongoDB 3.2 instance running on my Mac
  • POC Driver uses single thread to perform inserts only
  • Test run for 5 mins
  • Empty collection after every run
  • Default readConcern
  • Default writeConcern

Sample Document:

> db.testCol.findOne()
{
 "_id" : {
 "w" : 2,
 "i" : 0
 },
 "fld0" : NumberLong(530752),
 "fld1" : ISODate("2015-06-02T08:14:18.687Z"),
 "fld2" : "tempor invidunt ut labore et",
 "fld3" : "eirmod tempor invidunt ut",
 "fld4" : "Lorem ipsum dolor sit amet,",
 "fld5" : ISODate("2016-10-12T01:58:24.580Z"),
 "fld6" : NumberLong(1712708),
 "fld7" : "sit amet. Lorem ipsum dolor",
 "fld8" : "sit amet, consetetur",
 "fld9" : NumberLong(652363),
 "bin" : BinData(0,"")
}

Phase 1 (No indexes except on _id field)

After 295 seconds, 13340672 new records inserted - collection has 13340672 in total
46498 inserts per second since last report 99.97 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~45,222 inserts/second

Phase 2 (Compound index on fld1 to fld9)

After 295 seconds, 7739904 new records inserted - collection has 7739904 in total
20830 inserts per second since last report 99.72 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~26,236 inserts/second

Phase 3 (Single field indexes on each fld2, fld6 and fld8)

After 295 seconds, 2998272 new records inserted - collection has 2998272 in total
7012 inserts per second since last report 63.61 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~10,163 inserts/second

Storage Utilisation

On the same collection with 2,817,024 documents:

Compound Index

db.testCol.stats().indexSizes
{
 "_id_" : 75079680,
 "fld0_1_fld1_1_fld2_1_fld3_1_fld4_1_fld5_1_fld6_1_fl7_1_fld8_1_fld9_1" : 2617061376
}

~2.44GB

Single-Field Index

> db.testCol.stats().indexSizes
{
 "_id_" : 75079680,
 "fld0_1" : 65343488,
 "fld1_1" : 79781888,
 "fld2_1" : 35471360,
 "fld3_1" : 35475456,
 "fld4_1" : 35475456,
 "fld5_1" : 79781888,
 "fld6_1" : 65339392,
 "fld7_1" : 35475456,
 "fld8_1" : 35475456,
 "fld9_1" : 65347584,
}

~0.53GB

Index Selectivity Usage

For single-field index, there’s no limitation at all. MongoDB optimiser will pick and choose whichever index it deems fit. For compound index, it’s slightly different.

Query like these will not use the compound index:

  • db.testCol.find({fld9:1,fld8:1,fld7:1})
  • db.testCol.find({fld9:1,fld3:1,fld6:1})
  • db.testCol.find({fld9:1,fld8:1,fld7:1})

Query like these will use the compound index:

  • db.testCol.find({fld0:1,fld1:1})
  • db.testCol.find({fld5:1,fld8:1,fld0:1}) (not very efficient..)
  • db.testCol.find({fld0:1,fld3:1,fld5:1})

The general idea is that MongoDB will use it the compound index if it matches the index prefix. If it doesn’t match it correctly, it might still use the index if it is still more efficient that performing a full collection scan. The only way to find out is educate and test the queries yourself!

Summary

A very short summary in key points below:

  1. No index has the best insert performance (duh!!)
  2. Compound index is faster than Single-field index in insert performance
  3. Compound index uses more storage than Single-field index

Depending on the situation, choose your poison carefully 🙂

References:

P.S: The above test does not attempt to quantify the absolute performance numbers but rather provide a good performance degradation ratio for indexes

Regards,
Wei Shan

Leave a comment

How to perform MongoDB Sharded Cluster Cross-DC Failover

How to perform MongoDB Sharded Cluster Cross-DC Failover

Imagine the following MongoDB architecture, how do you perform a manual failover when the entire DC 1 is unavailable? I couldn’t find any official MongoDB documentation. Thus, I decided to write my own.

Note:

  1. MongoDB version 3.2
  2. Config Server Replica Set
  3. Assumed DC1 total failure and unrecoverable

mongodb-blog-post

High Level Steps:

  1. Stop mongos
  2. Update mongos configuration to point to the new single node config server replica set
  3. Reconfigure the config server to be a single node replica set
  4. Update config database shards collection
  5. Reconfigure the DC2 shards to be single node replica set
  6. Clean up shard caches
  7. Start mongos

Detailed Steps

Stop mongos

systemctl stop mongos

Update mongos configuration file to point to the new single node config server replica set

From:
configDB: rs0/node1:27019,node2:27019,node3:27019

To:
configDB: rs0/node3:27019

Reconfigure the config server to be a single node replica set

cfg = rs.conf()
cfg.members = [cfg.members[]]
rs.reconfig(cfg, {force : true})

Update config database shards collection

mongo localhost:27019
mongo> use config
mongodb> shards.update(
 { _id: "<replSetName" },
 { $set:
 {
 "host": "replSetName/<dc2_shard_hostname:27017"
 }
 }
)

Reconfigure the DC2 shards to be single node replica set

cfg = rs.conf()
cfg.members = [cfg.members[]]
rs.reconfig(cfg, {force : true})

Clean up shards caches (On all shard new single node replica set)

Add the following to the mongodb config file and reboot the mongod process
setParameter:
   recoverShardingState:false
# systemctl restart mongod

Clean up the config database cache

mongo> use admin
mongo> db.system.version.remove(
   { _id: "minOpTimeRecovery" },
   { writeConcern: { w: "majority" } }
)

Remove recoverShardingState: false from mongodb config file and reboot the mongod process

Start mongos

systemctl start mongos

Reference:

Regards,
Wei Shan

, ,

Leave a comment

Tips on using MongoDB Shell

Tips on using MongoDB Shell

I’m currently taking the MongoDB online course, M202 right now. It’s an advance course that covers sharded cluster, performance tuning and others useful topics. Also, I have learnt a couple of tricks about mongo shell from the course so far.

show profile

On a database that has profiling enabled, the show profile helper in the mongo shell displays the 5 most recent operations that took at least 1 millisecond to execute. Issue show profile from the mongo shell, as follows:

show profile

This is particularly useful when you don’t want to go through the entire collection. MongoDB query syntax isn’t the most intuitive language in the world and it helps when you can just do a quick analysis of the slow queries. If you need something more detailed, try this.

show log

There are many ways to view MongoDB log files:

  1. cat mongod.log
  2. via Kibana
  3. mongo shell, show log

I find that using #3 is handy, especially during a severity 1 issue. It shows all the logs that is currently in memory. This also means it will only show relevant logs 🙂

Hope this helps!

Regards,
Wei Shan

Leave a comment