Archive for category NoSQL

MongoDB 3.x – How to get more details about a collection in MongoDB

MongoDB 3.x – How to get more details about a collection in MongoDB

In MongoDB shell

mongo> db.collection.stats()

{{ "ns" : "db.collection", 
"count" : 1474168, 
"size" : 1628955640, 
"avgObjSize" : 1105, 
"storageSize" : 62074880, 
"capped" : false, 
"wiredTiger" : 
     { "metadata" : { "formatVersion" : 1 }, 
     "creationString" : "access_pattern_hint=none,
          allocation_size=4KB,app_metadata=(formatVersion=1),
          block_allocation=best,block_compressor=zlib,cache_resident=false,
          checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),
          exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,
          ignore_in_memory_cache_size=false,immutable=false,internal_item_max=0,
          internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,
          key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,
          leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=true,bloom=true,
          bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldest=false,
          chunk_count_limit=0,chunk_max=5GB,chunk_size=10MB,merge_max=15,merge_min=0),
          memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,
          prefix_compression_min=4,source=,split_deepen_min_child=0,split_deepen_per_child=0,
          split_pct=90,type=file,value_format=u",
"type" : "file", 
"uri" : "statistics:table:collection/0-8813564358223606299", 
"nindexes" : 3, 
"totalIndexSize" : 83034112, 
"indexSizes" : { "_id_" : 72835072, "st_1_ih_1_vt_1_ua_1" : 5107712, "ea_1" : 5091328 }, 
"ok" : 1}

From the above, we can gather a few important things (highlighted in bold):

  1. There are 1474168 documents in the collections
  2. On disk, ignoring any filesystem compression, it is about 62074880 bytes
  3. It is using zlib compression
  4. On disk, the datafile will be $dbPath/collection/0-8813564358223606299.wt
  5. There are 3 indexes on this collection

Hope this helps!

Regards,
Wei Shan

Leave a comment

Thoughts on PostgreSQL JSON Vs MongoDB JSON Performance

Thoughts on PostgreSQL JSON Vs MongoDB JSON Performance

Recently, I came across a 3 year old blog post from EnterpriseDB about benchmarking JSON performance on MongoDB Vs PostgreSQL via a random Quora response of someone bashing MongoDB performance. I use both PostgreSQL and MongoDB on a daily basis at my current role. They are both great database and good at what they were built to do. However, the benchmarking results were absolutely unbelievable. I couldn’t believe it and had to read the source code myself.

There were a couple of reasons why the results were highly inaccurate.

INSERTs Test

It uses bash to generate massive INSERT shell scripts and inserts into mongod via Mongo shell. This means that it is not benchmarking PostgreSQL against MongoDB. It is benchmarking PostgreSQL psql against MongoDB Javescript shell (SpiderMoney/v8). That’s not a very fair benchmark is it?

SELECTs Test

Same issue here. It uses Mongo shell to run the SELECT queries. Effectively, the performance will be bounded by the speed of the javascript engine. It was using v8 back in version 2.6. (Link)

Data Loading Test

This is the only fair comparison so far. It compares PostgreSQL’s COPY against MongoDB’s mongoimport. It was indeed true that PosgreSQL was faster than MongoDB in this test.

Summary

I love using both PostgreSQL and MongoDB. However, this benchmarking test fails prove that PostgreSQL was indeed faster than MongoDB in JSON performance. Also, it did not share the configurations of postgresql.conf and mongod.conf. MongoDB doesn’t work well with ext4 filesystem. So if they are using AWS Linux AMI, it’s going to be on ext4 unless they took the effort the mount external EBS volumes.

Since the results are already 3 years old, I plan to benchmark PostgreSQL and MongoDB over the next few weeks using PostgreSQL 9.6 against MongoDB 3.4 with WT storage engine. Also, I don’t work for MongoDB nor EnterpriseDB, this will be a unbiased test 🙂

Regards,
Wei Shan

1 Comment

Top 3 Reasons to upgrade to MongoDB 3.4

Top 3 Reasons to upgrade to MongoDB 3.4

MongoDB 3.4 was released on Nov 29, 2016 with tons of new features. In this post, I will be listing the top 3 reasons to upgrade to MongoDB 3.4.

Faster Balancing

In the previous releases, it takes ages(4s per chunk, last I seen in 3.2.10) for the balancer to move or pre-split chunks. Imagine if you have 100,000 chunks to pre-split, it’s going to take days. There are 2 reason for this:

  1. In a sharded cluster, there can only be 1 chunk migration at any point of time. It’s a single threaded operation
  2. This.

In MongoDB 3.4, for a sharded cluster with n shards, MongoDB can perform at most n/2 (rounded down) simultaneous chunk migrations. So, if you have a 4 shard MongoDB cluster, it will migrate 2 chunks at a time! If you have lesser than 4 shards, the behaviour will stay the same unfortunately. 😦

Linearizable Read Concern

In general, MongoDB has various readConcern option. These options allow the developers to choose different readConcern suitable for their application requirement. Do use readConcern:local for unimportant queries that doesn’t care if it reads data that could be rolled back. Using readConcern:majority prevents reading data that can be rolled back. However, it does not prevent you from reading stale data in unique network partitions edge cases.

In MongoDB 3.4, using readConcern:linearizable allows you to prevent stale reads too. This feature is not free and has the highest performance overhead. readConcern:local only require 1 operation to return your results. readConcern:majority requires a majority acknowledgement from a majority of the nodes in your replica set which depends on your network latency. readConcern:linearizable requires an additional no-op majority write to the replica set to ensure the data being read is durable and not stale.

In essence, what readConcern:linearizable does is the following:

  1. Client sends query to primary node
  2. Node gets the results without erroors
  3. Node sends a noop majority write to the other nodes in the cluster to confirm that it is still the primary node.
  4. Results gets returned to the client

Network Compression

I’m not sure why this wasn’t publicise widely as a new feature. I like it a lot because, now all the traffic between mongos/mongod are compressed so we save on bandwidth. This really saves you lots of money if you are running it on AWS or if you are constrained by the network bandwidth.

Summary

In general, I do agree with infoworld that MongoDB 3.4 is a no-brainer upgrade! We are currently waiting on it to be available on our O/S of choice so we can start rolling it out!

Hope this helps! 🙂

Regards,
Wei Shan

1 Comment

Migrating Oracle Sample Schema to MongoDB

Migrating Oracle Sample Schema to MongoDB

It is not an easy feat to migrate your current application using Oracle or any relational database to MongoDB. You will need to remodel your schema to fit into MongoDB. If you are migrating from a relational database to another relational database, it is a lot easier (Still not that easy though!). In this blog post, I try to migrate the famous Oracle database sample schema to MongoDB. I’m not saying that from an application POV(think ACID!), this is a good thing, I’m illustrating the fact that you can’t just take a schema from a relational database and dump it onto MongoDB.

Let’s take a look at the ERD diagram of sales history schema.

sh schema

The sales history schema above represents the sales history for a company.

The sample company does a high volume of business, so it runs business statistics reports to aid in decision support. Many of these reports are time-based and nonvolatile. That is, they analyse past data trends. The company loads data into its data warehouse regularly to gather statistics for these reports. These reports include annual, quarterly, monthly, and weekly sales figures by product. The company also runs reports on distribution channels through which its sales are delivered. When the company runs special promotions on its products, it analyse the impact of the promotions on sales. It also analyse sales by geographical area.

Based on my understanding above, there’s a few assumptions to be made here:

  1. This is a data warehouse schema, not OLTP workload
  2. The application query distribution channels through which its sales are delivered.
  3. The application query about promotion impact on sales
  4. The application query sales history by geographical area

Rules to follow in MongoDB data modelling

MongoDB understands that people who are new to NoSQL would not understand how to data model efficiently. They have an entire chapter of documentation dedicated to it! Some of the limitations or rules you should be aware of:

MongoDB “Sales History” Schema – Version 1

Based on the MongoDB official documentation in data modelling, the following would be good enough.

SALES

{
 "_id": {
 "PROD": {},
  "PROD_ID": {},
  "PROD_NAME": {},
  "PROD_DESC": {},
  "PROD_SUBCATEGORY": {},
  "PROD_SUBCATEGORY_ID": {},
  "PROD_SUBCATEGORY_DESC": {},
  "PROD_CATEGORY": {},
  "PROD_CATEGORY_ID": {},
  "PROD_CATEGORY_DESC": {},
  "PROD_WEIGHT_CLASS": {},
  "PROD_UNIT_OF_MEASURE": {},
  "PROD_PACK_SIZE": {},
  "PROD_SUPPLIER_ID": {},
  "PROD_STATUS": {},
  "PROD_LIST_PRICE": {},
  "PROD_MIN_PRICE": {},
  "PROD_TOTAL": {},
  "PROD_TOTAL_ID": {},
  "PROD_SRC_ID": {},
  "PROD_EFF_FROM": {},
  "PROD_EFF_TO": {},
  "PROD_VALID": {},
  "PROD_COST": {
  "TIME_DESC": {},
  "UNIT_COST": {},
  "UNIT_PRICE": {}
 }
 },
 "CUST": {
  "CUST_ID": {},
  "CUST_FIRST_NAME": {},
  "CUST_LAST_NAME": {},
  "CUST_GENDER": {},
  "CUST_YEAR_OF_BIRTH": {},
  "CUST_MARITAL_STATUS": {},
  "CUST_STREET_ADDRESS": {},
  "CUST_POSTAL_CODE": {},
  "CUST_CITY": {},
  "CUST_CITY_ID": {},
  "CUST_STATE_PROVINCE": {},
  "CUST_STATE_PROVINCE_ID": {},
  "COUNTRY": {
  "COUNTRY_ID": {},
   "COUNTRY_ISO_CODE": {},
   "COUNTRY_NAME": {},
   "COUNTRY_SUBREGION": {},
   "COUNTRY_SUBREGION_ID": {},
   "COUNTRY_REGION": {},
   "COUNTRY_REGION_ID": {},
   "COUNTRY_TOTAL": {},
   "COUNTRY_TOTAL_ID": {},
   "COUNTRY_NAME_HIST": {}
 },
  "CUST_MAIN_PHONE_NUMBER": {},
  "CUST_INCOME_LEVEL": {},
  "CUST_CREDIT_LIMIT": {},
  "CUST_EMAIL": {},
  "CUST_TOTAL": {},
  "CUST_TOTAL_ID": {},
  "CUST_SRC_ID": {},
  "CUST_EFF_FROM": {},
  "CUST_EFF_TO": {},
  "CUST_VALID": {}
 },
 "TIME": {
  "TIME_ID": {},
  "DAY_NAME": {},
  "DAY_NUMBER_IN_WEEK": {},
  "DAY_NUMBER_IN_MONTH": {},
  "CALENDAR_WEEK_NUMBER": {},
  "FISCAL_WEEK_NUMBER": {},
  "WEEK_ENDING_DAY": {},
  "WEEK_ENDING_DAY_ID": {},
  "CALENDAR_MONTH_NUMBER": {},
  "FISCAL_MONTH_NUMBER": {},
  "CALENDAR_MONTH_DESC": {},
  "CALENDAR_MONTH_ID": {},
  "FISCAL_MONTH_DESC": {},
  "FISCAL_MONTH_ID": {},
  "DAYS_IN_CAL_MONTH": {},
  "DAYS_IN_FIS_MONTH": {},
  "END_OF_CAL_MONTH": {},
  "END_OF_FIS_MONTH": {},
  "CALENDAR_MONTH_NAME": {},
  "FISCAL_MONTH_NAME": {},
  "CALENDAR_QUARTER_DESC": {},
  "CALENBDAR_QUARTER_ID": {},
  "FISCAL_QUARTER_DESC": {},
  "FISCAL_QUARTER_ID": {},
  "DAYS_IN_CAL_QUARTER": {},
  "DAYS_IN_FIS_QUARTER": {},
  "END_OF_CAL_QUARTER": {},
  "END_OF_FIS_QUARTER": {},
  "CALENDAR_QUARTER_NUMBER": {},
  "FISCAL_QUARTER_NUMBER": {},
  "CALENDAR_YEAR": {},
  "CALENDAR_YEAR_ID": {},
  "FISCAL_YEAR": {},
  "FISCAL_YEAR_ID": {},
  "DAYS_IN_CAL_YEAR": {},
  "DAYS_IN_FIS_YEAR": {},
  "END_OF_CAL_YEAR": {},
  "END_OF_FIS_YEAR": {}
 },
 "CHANNEL": {
  "CHANNEL_ID": {},
  "CHANNEL_DESC": {},
  "CHANNEL_CLASS": {},
  "CHANNEL_CLASS_ID": {},
  "CHANNEL_TOTAL": {},
  "CHANNEL_TOTAL_ID": {}
 },
 "PROMO": {
  "PROMO_ID": {},
  "PROMO_NAME": {},
  "PROMO_SUBCATEGORY": {},
  "PROMO_SUBCATEGORY_ID": {},
  "PROMO_CATEGORY": {},
  "PROMO_CATEGORY_ID": {},
  "PROMO_COST": {},
  "PROMO_BEGIN_DATE": {},
  "PROMO_END_DATE": {},
  "PROMO_TOTAL": {},
  "PROMO_TOTAL_ID": {}
 },
 "QUANTITY_SOLD": {},
 "AMOUNT_SOLD": {}
}

However, this design pattern will result into 1 big massive JSON document that is slow. It is also more likely to encounter the MongoDB limitations listed earlier. What is a better approach then?

MongoDB “Sales History” Schema – Version 2

Let’s go with the easy bits first.

CUSTOMER

{
 "_id": {},
 "CUST_FIRST_NAME": {},
 "CUST_LAST_NAME": {},
 "CUST_GENDER": {},
 "CUST_YEAR_OF_BIRTH": {},
 "CUST_MARITAL_STATUS": {},
 "CUST_STREET_ADDRESS": {},
 "CUST_POSTAL_CODE": {},
 "CUST_CITY": {},
 "CUST_STATE_PROVINCE": {},
 "COUNTRY": {
  "COUNTRY_NAME": {},
  "COUNTRY_ISO_CODE": {},
  "COUNTRY_SUBREGION": {},
  "COUNTRY_REGION": {},
  "COUNTRY_TOTAL": {},
  "COUNTRY_NAME_HIST": {}
 },
 "CUST_MAIN_PHONE_NUMBER": {},
 "CUST_INCOME_LEVEL": {},
 "CUST_CREDIT_LIMIT": {},
 "CUST_EMAIL": {},
 "CUST_TOTAL": {},
 "CUST_SRC_ID": {},
 "CUST_EFF_FROM": {},
 "CUST_EFF_TO": {},
 "CUST_VALID": {}
}

Simply follow this one to one relationship design pattern for the CUSTOMER collection.

PRODUCT

{
 "_id": {},
 "PROD_NAME": {},
 "PROD_DESC": {},
 "PROD_SUBCATEGORY": {
  "PROD_SUBCATEGORY_ID": {},
  "PROD_SUBCATEGORY_DESC": {}
 },
 "PROD_CATEGORY": {
  "PROD_CATEGORY_ID": {},
  "PROD_CATEGORY_DESC": {}
 },
 "PROD_WEIGHT_CLASS": {},
 "PROD_UNIT_OF_MEASURE": {},
 "PROD_PACK_SIZE": {},
 "PROD_SUPPLIER_ID": {},
 "PROD_STATUS": {},
 "PROD_LIST_PRICE": {},
 "PROD_MIN_PRICE": {},
 "PROD_TOTAL": {},
 "PROD_TOTAL_ID": {},
 "PROD_SRC_ID": {},
 "PROD_EFF_FROM": {},
 "PROD_EFF_TO": {},
 "PROD_VALID": {},
 "COST": {
  "TIME_ID": {},
  "PROMO_ID": {},
  "CHANNEL_ID": {},
  "UNIT_COST": {},
  "UNIT_PRICE": {}
 }
}

PROMOTION

{
 "_id": {},
 "PROMO_NAME": {},
 "PROMO_SUBCATEGORY": {},
 "PROMO_CATEGORY": {},
 "PROMO_COST": {},
 "PROMO_BEGIN_DATE": {},
 "PROMO_END_DATE": {},
 "PROMO_TOTAL": {}
}

I have chose to combine PRODUCTS and COSTS into a single PRODUCT collection. You may find this design pattern over here. I did not choose to have the PROMOTION in the same collection. Instead, I decided to use references. The reason is because a product might have hundreds or thousands of promotion, an embedded document here would not be scalable for the PRODUCT collection.

CHANNELS

{
 "_id": {},
 "CHANNEL_DESC": {},
 "CHANNEL_CLASS": {},
 "CHANNEL_CLASS_ID": {},
 "CHANNEL_TOTAL": {},
 "CHANNEL_TOTAL_ID": {}
}

TIMES

{
 "_id": {},
 "DAY_NAME": {},
 "DAY_NUMBER_IN_WEEK": {},
 "DAY_NUMBER_IN_MONTH": {},
 "CALENDAR_WEEK_NUMBER": {},
 "FISCAL_WEEK_NUMBER": {},
 "WEEK_ENDING_DAY": {},
 "WEEK_ENDING_DAY_ID": {},
 "CALENDAR_MONTH_NUMBER": {},
 "FISCAL_MONTH_NUMBER": {},
 "CALENDAR_MONTH_DESC": {},
 "CALENDAR_MONTH_ID": {},
 "FISCAL_MONTH_DESC": {},
 "FISCAL_MONTH_ID": {},
 "DAYS_IN_CAL_MONTH": {},
 "DAYS_IN_FIS_MONTH": {},
 "END_OF_CAL_MONTH": {},
 "END_OF_FIS_MONTH": {},
 "CALENDAR_MONTH_NAME": {},
 "FISCAL_MONTH_NAME": {},
 "CALENDAR_QUARTER_DESC": {},
 "CALENDAR_QUARTER_ID": {},
 "FISCAL_QUARTER_DESC": {},
 "FISCAL_QUARTER_ID": {},
 "DAYS_IN_CAL_QUARTER": {},
 "DAYS_IN_FIS_QUARTER": {},
 "END_OF_CAL_QUARTER": {},
 "END_OF_FIS_QUARTER": {},
 "CALENDAR_QUARTER_NUMBER": {},
 "FISCAL_QUARTER_NUMBER": {},
 "CALENDAR_YEAR": {},
 "CALENDAR_YEAR_ID": {},
 "FISCAL_YEAR": {},
 "FISCAL_YEAR_ID": {},
 "DAYS_IN_CAL_YEAR": {},
 "DAYS_IN_FIS_YEAR": {},
 "END_OF_CAL_YEAR": {},
 "END_OF_FIS_YEAR": {}
}

Honestly, I’m not too happy to keep TIMES and CHANNEL collection separate. However, if I try to embed them in other collection (COST/SALES), I would not be able to create a new TIMES or CHANNEL document without having a COST or SALES for it.

SALES

{
 "PROD_ID": {},
 "CUST_ID": {},
 "TIME_ID": {},
 "CHANNEL_ID": {},
 "PROMO_ID": {},
 "QUANTITY_SOLD": {},
 "AMOUNT_SOLD": {}
}

I would shard the SALES collection on TIME_ID, CHANNEL_ID and PROMO_ID. Based on the assumptions previously, I would assume that majority of the query will filter by TIME_ID, PROMO_ID and CHANNEL_ID.

Summary

On the example above, I assumed that most of the queries will be revolving around SALES collection. On various collections above, I tried to guess the collection growth size and decide if I should embed the collection or use ID reference instead. Obviously, if my assumptions are wrong, the queries would be sub-optimal.

As you might have noticed, just because MongoDB doesn’t have a schema restriction, it doesn’t mean that you don’t need data modelling. In fact, it’s a paradigm shift for schema design between relational database and MongoDB. In MongoDB, you need to design your schema based on the type of questions that your application will be asking.

Now, if you are SQL developer, you will notice that there will be a need to join some of the collections together for BI reports. If so, you will need to do that at the application layer. If you are looking for a single left outer join, use $lookup instead.

At the end of the day, MongoDB is not suitable for all application! Yes, you can now use $Graphlook for finding relationships between objects. But you wouldn’t want to based your brand new Uber-For-Facebook(the friend suggestion feed) on MongoDB. It is clearly not designed to do that. In the era of NoSQL, use the right database for the correct problem.

References:

P.S: Do let you know if you have any other ideas on modelling the sample schema. I would love to hear your thoughts on this!

Regards,
Wei Shan

Leave a comment

Compound Index Vs Multiple Single Field Index in MongoDB 3.x – Part II

Compound Index Vs Multiple Single Field Index in MongoDB 3.x – Part II

After the first blog post was published, it was suggested to me that I should run the test with w:1 j:true. It make sense since no one would or should be running workloads with w:1, j:false.

By definition:

With j: true, MongoDB returns only after the requested number of members, including the primary, have written to the journal. Previously j: true write concern in a replica set only requires the primary to write to the journal, regardless of the w: <value> write concern.

If you are using MMapv1, then you should be using j:true else you risk data loss. If you are using WiredTiger, you will only lose at most 60 seconds of data between the checkpoints.

Insert Performance

No Index

After 295 seconds, 5754368 new records inserted - collection has 5754368 in total
20017 inserts per second since last report 99.62 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~19,763.2 inserts/second

Compound Index

After 295 seconds, 4497408 new records inserted - collection has 4497408 in total
14131 inserts per second since last report 98.63 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~15,245.4 inserts/second (22% performance reduction)

Single-Field Index

After 295 seconds, 2438144 new records inserted - collection has 2438144 in total
6758 inserts per second since last report 35.55 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~8,264.8 inserts/second (58% performance reduction)

Summary

The drop in performance ratio has reduced from when j:false. Previously it was about 40% performance drop with compound index and 78% performance drop with single-field indexes. Turns out, compound index isn’t too bad actually, taking into consideration that I created a 9 field compound index on the collection.

Regards,
Wei Shan

Leave a comment

Compound Index Vs Multiple Single Field Index in MongoDB 3.x

Compound Index Vs Multiple Single Field Index in MongoDB 3.x

MongoDB supports both compound index and single field index. It’s a common knowledge that index decrease your INSERTS/UPDATES performance but increases your SELECT performance. However, if you need to index multiple fields, is it better to use a compound index or just create multiple single field indexes? This post tries to investigate which is the better option in terms of performance.

Insert Performance

Test Setup:

  • Single MongoDB 3.2 instance running on my Mac
  • POC Driver uses single thread to perform inserts only
  • Test run for 5 mins
  • Empty collection after every run
  • Default readConcern
  • Default writeConcern

Sample Document:

> db.testCol.findOne()
{
 "_id" : {
 "w" : 2,
 "i" : 0
 },
 "fld0" : NumberLong(530752),
 "fld1" : ISODate("2015-06-02T08:14:18.687Z"),
 "fld2" : "tempor invidunt ut labore et",
 "fld3" : "eirmod tempor invidunt ut",
 "fld4" : "Lorem ipsum dolor sit amet,",
 "fld5" : ISODate("2016-10-12T01:58:24.580Z"),
 "fld6" : NumberLong(1712708),
 "fld7" : "sit amet. Lorem ipsum dolor",
 "fld8" : "sit amet, consetetur",
 "fld9" : NumberLong(652363),
 "bin" : BinData(0,"")
}

Phase 1 (No indexes except on _id field)

After 295 seconds, 13340672 new records inserted - collection has 13340672 in total
46498 inserts per second since last report 99.97 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~45,222 inserts/second

Phase 2 (Compound index on fld1 to fld9)

After 295 seconds, 7739904 new records inserted - collection has 7739904 in total
20830 inserts per second since last report 99.72 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~26,236 inserts/second

Phase 3 (Single field indexes on each fld2, fld6 and fld8)

After 295 seconds, 2998272 new records inserted - collection has 2998272 in total
7012 inserts per second since last report 63.61 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~10,163 inserts/second

Storage Utilisation

On the same collection with 2,817,024 documents:

Compound Index

db.testCol.stats().indexSizes
{
 "_id_" : 75079680,
 "fld0_1_fld1_1_fld2_1_fld3_1_fld4_1_fld5_1_fld6_1_fl7_1_fld8_1_fld9_1" : 2617061376
}

~2.44GB

Single-Field Index

> db.testCol.stats().indexSizes
{
 "_id_" : 75079680,
 "fld0_1" : 65343488,
 "fld1_1" : 79781888,
 "fld2_1" : 35471360,
 "fld3_1" : 35475456,
 "fld4_1" : 35475456,
 "fld5_1" : 79781888,
 "fld6_1" : 65339392,
 "fld7_1" : 35475456,
 "fld8_1" : 35475456,
 "fld9_1" : 65347584,
}

~0.53GB

Index Selectivity Usage

For single-field index, there’s no limitation at all. MongoDB optimiser will pick and choose whichever index it deems fit. For compound index, it’s slightly different.

Query like these will not use the compound index:

  • db.testCol.find({fld9:1,fld8:1,fld7:1})
  • db.testCol.find({fld9:1,fld3:1,fld6:1})
  • db.testCol.find({fld9:1,fld8:1,fld7:1})

Query like these will use the compound index:

  • db.testCol.find({fld0:1,fld1:1})
  • db.testCol.find({fld5:1,fld8:1,fld0:1}) (not very efficient..)
  • db.testCol.find({fld0:1,fld3:1,fld5:1})

The general idea is that MongoDB will use it the compound index if it matches the index prefix. If it doesn’t match it correctly, it might still use the index if it is still more efficient that performing a full collection scan. The only way to find out is educate and test the queries yourself!

Summary

A very short summary in key points below:

  1. No index has the best insert performance (duh!!)
  2. Compound index is faster than Single-field index in insert performance
  3. Compound index uses more storage than Single-field index

Depending on the situation, choose your poison carefully 🙂

References:

P.S: The above test does not attempt to quantify the absolute performance numbers but rather provide a good performance degradation ratio for indexes

Regards,
Wei Shan

Leave a comment

How to perform MongoDB Sharded Cluster Cross-DC Failover

How to perform MongoDB Sharded Cluster Cross-DC Failover

Imagine the following MongoDB architecture, how do you perform a manual failover when the entire DC 1 is unavailable? I couldn’t find any official MongoDB documentation. Thus, I decided to write my own.

Note:

  1. MongoDB version 3.2
  2. Config Server Replica Set
  3. Assumed DC1 total failure and unrecoverable

mongodb-blog-post

High Level Steps:

  1. Stop mongos
  2. Update mongos configuration to point to the new single node config server replica set
  3. Reconfigure the config server to be a single node replica set
  4. Update config database shards collection
  5. Reconfigure the DC2 shards to be single node replica set
  6. Clean up shard caches
  7. Start mongos

Detailed Steps

Stop mongos

systemctl stop mongos

Update mongos configuration file to point to the new single node config server replica set

From:
configDB: rs0/node1:27019,node2:27019,node3:27019

To:
configDB: rs0/node3:27019

Reconfigure the config server to be a single node replica set

cfg = rs.conf()
cfg.members = [cfg.members[]]
rs.reconfig(cfg, {force : true})

Update config database shards collection

mongo localhost:27019
mongo> use config
mongodb> shards.update(
 { _id: "<replSetName" },
 { $set:
 {
 "host": "replSetName/<dc2_shard_hostname:27017"
 }
 }
)

Reconfigure the DC2 shards to be single node replica set

cfg = rs.conf()
cfg.members = [cfg.members[]]
rs.reconfig(cfg, {force : true})

Clean up shards caches (On all shard new single node replica set)

Add the following to the mongodb config file and reboot the mongod process
setParameter:
   recoverShardingState:false
# systemctl restart mongod

Clean up the config database cache

mongo> use admin
mongo> db.system.version.remove(
   { _id: "minOpTimeRecovery" },
   { writeConcern: { w: "majority" } }
)

Remove recoverShardingState: false from mongodb config file and reboot the mongod process

Start mongos

systemctl start mongos

Reference:

Regards,
Wei Shan

, ,

Leave a comment

A neutral perspective of the current database industry

A neutral perspective of the current database industry

According to db-engines, there are more than 317 database software in the market right now. If you have requirements to develop a new application , how do you know which database to use? When I started my career in 2011, the de facto standard was to use Oracle database. It was mature, reliable and able to support multiple types of workload. If you can’t afford Oracle database, you could use MySQL, if you require high performance with HA capabilities. If you require a mature product with strict compliant to ANSI SQL standards, there’s always PostgreSQL. However, Oracle is still the #1 choice for a lot of reasons. Also, Oracle DBAs were also paid the most, compared to other databases DBA.

Five years has passed, a lot has changed in the database industry. Even though Oracle is still the #1 in the market, there are many other comparable choices now. Oracle supporters will claim that Oracle database is still the best and no other database products will come close. They will also claim that NoSQL products are overhyped and will die down soon enough. Open-source supporters will spread hate on Oracle, claim that it’s overpriced and that you don’t need Oracle database at all.

In my opinion, enterprise companies AKA companies whom are loaded with cash, will still continue to pay 7 figures for Oracle support contract. They continue to do so because they are not willing to migrate their legacy application to something else. Also, It doesn’t make sense for them to use other database software when they have unlimited license usage for Oracle software. It’s also possible that the CIO aren’t willing to risk his/her neck by changing the database stack 🙂 For companies that aren’t as rich, usually their ERP software will still be stuck using Oracle database. That’s if they haven’t move to a cloud ERP yet. For new applications, they are already looking at alternatives.

Oracle is still the #1 database, in terms of technical superiority. No other databases vendors have implemented Oracle’s Real Application Cluster (RAC). It’s an extremely powerful feature that offers possibly, zero downtime for node failure. However, how many real life applications require that? Is it feasible to have that feature for an additional 23,000 USD/core? Is it acceptable to have 15s of downtime during node failure instead? That would decrease your cost to about nothing for licensing.

If you have a multiple terabyte application, the cost of having that on a Oracle database is not going to be cheap. A four-core EE licensed server is going to cost you at least 200,000 USD already. If you run it on MongoDB or MySQL, it’s not going to cost you anything. Obviously, your application needs to be able to able to support that (ACID compliant). If you help your company to save that 200k USD, you might find a sweeter bonus at the end of the year!

Oracle database has got a lot of great technology behind it. The database optimisation engine is far more impressive than any other vendors I know. They had parallel query even before version 8i. PostgreSQL only had it recently. However, Oracle database sales team has been trying to piss off their customers for awhile now.I know that several banks in Asean region are using PostgreSQL/MySQL for their new applications Also, many companies are banning new Oracle database implementations implicitly unless there’s no alternatives. People didn’t have much choice for choosing database software back then. It is vastly different now.

It’s a lot harder to pick the right database for new applications now. It used to be just Oracle and Oracle only. These days, you got to think of the nature and volume of the data/application. You also need to decide if your application requires ACID. If you are using NoSQL, there’s graph, key-value, document and columnar. You don’t really want to use the wrong database for your application. Using the wrong datastore for your application will result into sub-optimal performance or ugly workarounds that you wouldn’t like. For example, you could use MongoDB for text search. However, a better database for heavy text search use case would be Elasticsearch. Just because MongoDB have text indexes and is able to perform text search doesn’t mean it’s the right datastore for it. Heck, you could even use MySQL as a pure key-value storage. Databases is all about polyglot persistence these days.

I believe that in years to come, Oracle database will become the mainframe of database software. Highly profitable yet legacy piece of software whereby companies only use it because they were too afraid to use something else. For all my friends that still works actively on Oracle databases, perhaps it’s time to learn a new skill or move on to something else.

Regards,
Wei Shan

4 Comments