Wei Shan

This user hasn't shared any biographical information

What does DevOps mean in the DBA World?

What does DevOps mean in the DBA World?

Since April 2015, I had been working as a database engineer in various DevOps-oriented team. I firmly believe that we should all transform from being a DBA to a Database Engineer. We can no longer be just the administrators of our databases, we need to be part of the entire deployment lifecycle. Being an engineer means we need to be able to code and automate our database deployments. We need to be able to read application code and understand system internals. Gone are the days where we were part of a massive 25 man DBA team and we care only about databases. Anything above or below the database tier is not our problem.

With DevOps/Microservices, we will see teams being structured around features/services. For example, for a search-box feature, you have 4 developers, 2 DevOps Sysadmin and 1 DevOps DBA. With companies like Facebook/Google adopting such team structure, more companies should follow suit.

I has always been curious about how other DBAs adopt DevOps and have tried to understand from others during conferences or meet-ups. There are my observations from DBAs across London and Singapore:

  1. Chef/Puppet/Ansible are the most popular Configuration Management(CM) tools
  2. Most engineers uses CM to manage the database configuration files like postgresql.conf or my.cnf
  3. Very few engineers uses CM to manage database schema and objects
  4. Very few engineers uses CM to manage database users and roles
  5. Very few engineers automate or semi-automate any form of database minor version upgrades or rolling restarts
  6. Even fewer engineers perform automated database recovery testing

Personally, I’m still on the journey to accomplish the last 4 points. They can be achieved having more logic in your CM tool, rundeck and a combination of both. However, all the above points are the technical aspects.

What I agree fully with Chris Dodds, is the change of mentality. Automation and processes should be good enough to prevent an engineer from bringing down a service. If an engineer is able to delete a production database directory accidentally, then the process has a problem, not the engineer. I have worked in or with larger MNCs that has a very strong blaming culture. Any form of failure is not accepted and you will get the wrath during post-mortem meetings.

This kind of culture has a saying in Singapore: The more you do, the more mistakes you make. Thus, people start to do the bare minimum. This is not ideal in DevOps, it promotes mediocrity.

It’s a paradigm shift for us in the new era of DevOps. Our job should not be creating database objects or managing users. We should be spending most of our time in automating/improving the processes, testing/benchmarking new database version or benchmarking different database technologies. There should be as little manual work as possible (Google call it TOIL). Fault blaming should be thrown out of the window.

In the heart of DevOps, among Microservices, Docker, Containers, Kubernetes, Chef, Vagrant and other hundreds of other technologies/buzzwords, it is a mindset change that’s the most important. Without a proper DevOps culture, the technology doesn’t matter, it will only beautify your resume.

This blog post was inspired by this and this.

Regards,
Wei Shan

Leave a comment

MongoDB 3.x – How to get more details about a collection in MongoDB

MongoDB 3.x – How to get more details about a collection in MongoDB

In MongoDB shell

mongo> db.collection.stats()

{{ "ns" : "db.collection", 
"count" : 1474168, 
"size" : 1628955640, 
"avgObjSize" : 1105, 
"storageSize" : 62074880, 
"capped" : false, 
"wiredTiger" : 
     { "metadata" : { "formatVersion" : 1 }, 
     "creationString" : "access_pattern_hint=none,
          allocation_size=4KB,app_metadata=(formatVersion=1),
          block_allocation=best,block_compressor=zlib,cache_resident=false,
          checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),
          exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,
          ignore_in_memory_cache_size=false,immutable=false,internal_item_max=0,
          internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,
          key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,
          leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=true,bloom=true,
          bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldest=false,
          chunk_count_limit=0,chunk_max=5GB,chunk_size=10MB,merge_max=15,merge_min=0),
          memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,
          prefix_compression_min=4,source=,split_deepen_min_child=0,split_deepen_per_child=0,
          split_pct=90,type=file,value_format=u",
"type" : "file", 
"uri" : "statistics:table:collection/0-8813564358223606299", 
"nindexes" : 3, 
"totalIndexSize" : 83034112, 
"indexSizes" : { "_id_" : 72835072, "st_1_ih_1_vt_1_ua_1" : 5107712, "ea_1" : 5091328 }, 
"ok" : 1}

From the above, we can gather a few important things (highlighted in bold):

  1. There are 1474168 documents in the collections
  2. On disk, ignoring any filesystem compression, it is about 62074880 bytes
  3. It is using zlib compression
  4. On disk, the datafile will be $dbPath/collection/0-8813564358223606299.wt
  5. There are 3 indexes on this collection

Hope this helps!

Regards,
Wei Shan

Leave a comment

Thoughts on PostgreSQL JSON Vs MongoDB JSON Performance

Thoughts on PostgreSQL JSON Vs MongoDB JSON Performance

Recently, I came across a 3 year old blog post from EnterpriseDB about benchmarking JSON performance on MongoDB Vs PostgreSQL via a random Quora response of someone bashing MongoDB performance. I use both PostgreSQL and MongoDB on a daily basis at my current role. They are both great database and good at what they were built to do. However, the benchmarking results were absolutely unbelievable. I couldn’t believe it and had to read the source code myself.

There were a couple of reasons why the results were highly inaccurate.

INSERTs Test

It uses bash to generate massive INSERT shell scripts and inserts into mongod via Mongo shell. This means that it is not benchmarking PostgreSQL against MongoDB. It is benchmarking PostgreSQL psql against MongoDB Javescript shell (SpiderMoney/v8). That’s not a very fair benchmark is it?

SELECTs Test

Same issue here. It uses Mongo shell to run the SELECT queries. Effectively, the performance will be bounded by the speed of the javascript engine. It was using v8 back in version 2.6. (Link)

Data Loading Test

This is the only fair comparison so far. It compares PostgreSQL’s COPY against MongoDB’s mongoimport. It was indeed true that PosgreSQL was faster than MongoDB in this test.

Summary

I love using both PostgreSQL and MongoDB. However, this benchmarking test fails prove that PostgreSQL was indeed faster than MongoDB in JSON performance. Also, it did not share the configurations of postgresql.conf and mongod.conf. MongoDB doesn’t work well with ext4 filesystem. So if they are using AWS Linux AMI, it’s going to be on ext4 unless they took the effort the mount external EBS volumes.

Since the results are already 3 years old, I plan to benchmark PostgreSQL and MongoDB over the next few weeks using PostgreSQL 9.6 against MongoDB 3.4 with WT storage engine. Also, I don’t work for MongoDB nor EnterpriseDB, this will be a unbiased test 🙂

Regards,
Wei Shan

1 Comment

Top 3 Reasons to upgrade to MongoDB 3.4

Top 3 Reasons to upgrade to MongoDB 3.4

MongoDB 3.4 was released on Nov 29, 2016 with tons of new features. In this post, I will be listing the top 3 reasons to upgrade to MongoDB 3.4.

Faster Balancing

In the previous releases, it takes ages(4s per chunk, last I seen in 3.2.10) for the balancer to move or pre-split chunks. Imagine if you have 100,000 chunks to pre-split, it’s going to take days. There are 2 reason for this:

  1. In a sharded cluster, there can only be 1 chunk migration at any point of time. It’s a single threaded operation
  2. This.

In MongoDB 3.4, for a sharded cluster with n shards, MongoDB can perform at most n/2 (rounded down) simultaneous chunk migrations. So, if you have a 4 shard MongoDB cluster, it will migrate 2 chunks at a time! If you have lesser than 4 shards, the behaviour will stay the same unfortunately. 😦

Linearizable Read Concern

In general, MongoDB has various readConcern option. These options allow the developers to choose different readConcern suitable for their application requirement. Do use readConcern:local for unimportant queries that doesn’t care if it reads data that could be rolled back. Using readConcern:majority prevents reading data that can be rolled back. However, it does not prevent you from reading stale data in unique network partitions edge cases.

In MongoDB 3.4, using readConcern:linearizable allows you to prevent stale reads too. This feature is not free and has the highest performance overhead. readConcern:local only require 1 operation to return your results. readConcern:majority requires a majority acknowledgement from a majority of the nodes in your replica set which depends on your network latency. readConcern:linearizable requires an additional no-op majority write to the replica set to ensure the data being read is durable and not stale.

In essence, what readConcern:linearizable does is the following:

  1. Client sends query to primary node
  2. Node gets the results without erroors
  3. Node sends a noop majority write to the other nodes in the cluster to confirm that it is still the primary node.
  4. Results gets returned to the client

Network Compression

I’m not sure why this wasn’t publicise widely as a new feature. I like it a lot because, now all the traffic between mongos/mongod are compressed so we save on bandwidth. This really saves you lots of money if you are running it on AWS or if you are constrained by the network bandwidth.

Summary

In general, I do agree with infoworld that MongoDB 3.4 is a no-brainer upgrade! We are currently waiting on it to be available on our O/S of choice so we can start rolling it out!

Hope this helps! 🙂

Regards,
Wei Shan

1 Comment

Migrating Oracle Sample Schema to MongoDB

Migrating Oracle Sample Schema to MongoDB

It is not an easy feat to migrate your current application using Oracle or any relational database to MongoDB. You will need to remodel your schema to fit into MongoDB. If you are migrating from a relational database to another relational database, it is a lot easier (Still not that easy though!). In this blog post, I try to migrate the famous Oracle database sample schema to MongoDB. I’m not saying that from an application POV(think ACID!), this is a good thing, I’m illustrating the fact that you can’t just take a schema from a relational database and dump it onto MongoDB.

Let’s take a look at the ERD diagram of sales history schema.

sh schema

The sales history schema above represents the sales history for a company.

The sample company does a high volume of business, so it runs business statistics reports to aid in decision support. Many of these reports are time-based and nonvolatile. That is, they analyse past data trends. The company loads data into its data warehouse regularly to gather statistics for these reports. These reports include annual, quarterly, monthly, and weekly sales figures by product. The company also runs reports on distribution channels through which its sales are delivered. When the company runs special promotions on its products, it analyse the impact of the promotions on sales. It also analyse sales by geographical area.

Based on my understanding above, there’s a few assumptions to be made here:

  1. This is a data warehouse schema, not OLTP workload
  2. The application query distribution channels through which its sales are delivered.
  3. The application query about promotion impact on sales
  4. The application query sales history by geographical area

Rules to follow in MongoDB data modelling

MongoDB understands that people who are new to NoSQL would not understand how to data model efficiently. They have an entire chapter of documentation dedicated to it! Some of the limitations or rules you should be aware of:

MongoDB “Sales History” Schema – Version 1

Based on the MongoDB official documentation in data modelling, the following would be good enough.

SALES

{
 "_id": {
 "PROD": {},
  "PROD_ID": {},
  "PROD_NAME": {},
  "PROD_DESC": {},
  "PROD_SUBCATEGORY": {},
  "PROD_SUBCATEGORY_ID": {},
  "PROD_SUBCATEGORY_DESC": {},
  "PROD_CATEGORY": {},
  "PROD_CATEGORY_ID": {},
  "PROD_CATEGORY_DESC": {},
  "PROD_WEIGHT_CLASS": {},
  "PROD_UNIT_OF_MEASURE": {},
  "PROD_PACK_SIZE": {},
  "PROD_SUPPLIER_ID": {},
  "PROD_STATUS": {},
  "PROD_LIST_PRICE": {},
  "PROD_MIN_PRICE": {},
  "PROD_TOTAL": {},
  "PROD_TOTAL_ID": {},
  "PROD_SRC_ID": {},
  "PROD_EFF_FROM": {},
  "PROD_EFF_TO": {},
  "PROD_VALID": {},
  "PROD_COST": {
  "TIME_DESC": {},
  "UNIT_COST": {},
  "UNIT_PRICE": {}
 }
 },
 "CUST": {
  "CUST_ID": {},
  "CUST_FIRST_NAME": {},
  "CUST_LAST_NAME": {},
  "CUST_GENDER": {},
  "CUST_YEAR_OF_BIRTH": {},
  "CUST_MARITAL_STATUS": {},
  "CUST_STREET_ADDRESS": {},
  "CUST_POSTAL_CODE": {},
  "CUST_CITY": {},
  "CUST_CITY_ID": {},
  "CUST_STATE_PROVINCE": {},
  "CUST_STATE_PROVINCE_ID": {},
  "COUNTRY": {
  "COUNTRY_ID": {},
   "COUNTRY_ISO_CODE": {},
   "COUNTRY_NAME": {},
   "COUNTRY_SUBREGION": {},
   "COUNTRY_SUBREGION_ID": {},
   "COUNTRY_REGION": {},
   "COUNTRY_REGION_ID": {},
   "COUNTRY_TOTAL": {},
   "COUNTRY_TOTAL_ID": {},
   "COUNTRY_NAME_HIST": {}
 },
  "CUST_MAIN_PHONE_NUMBER": {},
  "CUST_INCOME_LEVEL": {},
  "CUST_CREDIT_LIMIT": {},
  "CUST_EMAIL": {},
  "CUST_TOTAL": {},
  "CUST_TOTAL_ID": {},
  "CUST_SRC_ID": {},
  "CUST_EFF_FROM": {},
  "CUST_EFF_TO": {},
  "CUST_VALID": {}
 },
 "TIME": {
  "TIME_ID": {},
  "DAY_NAME": {},
  "DAY_NUMBER_IN_WEEK": {},
  "DAY_NUMBER_IN_MONTH": {},
  "CALENDAR_WEEK_NUMBER": {},
  "FISCAL_WEEK_NUMBER": {},
  "WEEK_ENDING_DAY": {},
  "WEEK_ENDING_DAY_ID": {},
  "CALENDAR_MONTH_NUMBER": {},
  "FISCAL_MONTH_NUMBER": {},
  "CALENDAR_MONTH_DESC": {},
  "CALENDAR_MONTH_ID": {},
  "FISCAL_MONTH_DESC": {},
  "FISCAL_MONTH_ID": {},
  "DAYS_IN_CAL_MONTH": {},
  "DAYS_IN_FIS_MONTH": {},
  "END_OF_CAL_MONTH": {},
  "END_OF_FIS_MONTH": {},
  "CALENDAR_MONTH_NAME": {},
  "FISCAL_MONTH_NAME": {},
  "CALENDAR_QUARTER_DESC": {},
  "CALENBDAR_QUARTER_ID": {},
  "FISCAL_QUARTER_DESC": {},
  "FISCAL_QUARTER_ID": {},
  "DAYS_IN_CAL_QUARTER": {},
  "DAYS_IN_FIS_QUARTER": {},
  "END_OF_CAL_QUARTER": {},
  "END_OF_FIS_QUARTER": {},
  "CALENDAR_QUARTER_NUMBER": {},
  "FISCAL_QUARTER_NUMBER": {},
  "CALENDAR_YEAR": {},
  "CALENDAR_YEAR_ID": {},
  "FISCAL_YEAR": {},
  "FISCAL_YEAR_ID": {},
  "DAYS_IN_CAL_YEAR": {},
  "DAYS_IN_FIS_YEAR": {},
  "END_OF_CAL_YEAR": {},
  "END_OF_FIS_YEAR": {}
 },
 "CHANNEL": {
  "CHANNEL_ID": {},
  "CHANNEL_DESC": {},
  "CHANNEL_CLASS": {},
  "CHANNEL_CLASS_ID": {},
  "CHANNEL_TOTAL": {},
  "CHANNEL_TOTAL_ID": {}
 },
 "PROMO": {
  "PROMO_ID": {},
  "PROMO_NAME": {},
  "PROMO_SUBCATEGORY": {},
  "PROMO_SUBCATEGORY_ID": {},
  "PROMO_CATEGORY": {},
  "PROMO_CATEGORY_ID": {},
  "PROMO_COST": {},
  "PROMO_BEGIN_DATE": {},
  "PROMO_END_DATE": {},
  "PROMO_TOTAL": {},
  "PROMO_TOTAL_ID": {}
 },
 "QUANTITY_SOLD": {},
 "AMOUNT_SOLD": {}
}

However, this design pattern will result into 1 big massive JSON document that is slow. It is also more likely to encounter the MongoDB limitations listed earlier. What is a better approach then?

MongoDB “Sales History” Schema – Version 2

Let’s go with the easy bits first.

CUSTOMER

{
 "_id": {},
 "CUST_FIRST_NAME": {},
 "CUST_LAST_NAME": {},
 "CUST_GENDER": {},
 "CUST_YEAR_OF_BIRTH": {},
 "CUST_MARITAL_STATUS": {},
 "CUST_STREET_ADDRESS": {},
 "CUST_POSTAL_CODE": {},
 "CUST_CITY": {},
 "CUST_STATE_PROVINCE": {},
 "COUNTRY": {
  "COUNTRY_NAME": {},
  "COUNTRY_ISO_CODE": {},
  "COUNTRY_SUBREGION": {},
  "COUNTRY_REGION": {},
  "COUNTRY_TOTAL": {},
  "COUNTRY_NAME_HIST": {}
 },
 "CUST_MAIN_PHONE_NUMBER": {},
 "CUST_INCOME_LEVEL": {},
 "CUST_CREDIT_LIMIT": {},
 "CUST_EMAIL": {},
 "CUST_TOTAL": {},
 "CUST_SRC_ID": {},
 "CUST_EFF_FROM": {},
 "CUST_EFF_TO": {},
 "CUST_VALID": {}
}

Simply follow this one to one relationship design pattern for the CUSTOMER collection.

PRODUCT

{
 "_id": {},
 "PROD_NAME": {},
 "PROD_DESC": {},
 "PROD_SUBCATEGORY": {
  "PROD_SUBCATEGORY_ID": {},
  "PROD_SUBCATEGORY_DESC": {}
 },
 "PROD_CATEGORY": {
  "PROD_CATEGORY_ID": {},
  "PROD_CATEGORY_DESC": {}
 },
 "PROD_WEIGHT_CLASS": {},
 "PROD_UNIT_OF_MEASURE": {},
 "PROD_PACK_SIZE": {},
 "PROD_SUPPLIER_ID": {},
 "PROD_STATUS": {},
 "PROD_LIST_PRICE": {},
 "PROD_MIN_PRICE": {},
 "PROD_TOTAL": {},
 "PROD_TOTAL_ID": {},
 "PROD_SRC_ID": {},
 "PROD_EFF_FROM": {},
 "PROD_EFF_TO": {},
 "PROD_VALID": {},
 "COST": {
  "TIME_ID": {},
  "PROMO_ID": {},
  "CHANNEL_ID": {},
  "UNIT_COST": {},
  "UNIT_PRICE": {}
 }
}

PROMOTION

{
 "_id": {},
 "PROMO_NAME": {},
 "PROMO_SUBCATEGORY": {},
 "PROMO_CATEGORY": {},
 "PROMO_COST": {},
 "PROMO_BEGIN_DATE": {},
 "PROMO_END_DATE": {},
 "PROMO_TOTAL": {}
}

I have chose to combine PRODUCTS and COSTS into a single PRODUCT collection. You may find this design pattern over here. I did not choose to have the PROMOTION in the same collection. Instead, I decided to use references. The reason is because a product might have hundreds or thousands of promotion, an embedded document here would not be scalable for the PRODUCT collection.

CHANNELS

{
 "_id": {},
 "CHANNEL_DESC": {},
 "CHANNEL_CLASS": {},
 "CHANNEL_CLASS_ID": {},
 "CHANNEL_TOTAL": {},
 "CHANNEL_TOTAL_ID": {}
}

TIMES

{
 "_id": {},
 "DAY_NAME": {},
 "DAY_NUMBER_IN_WEEK": {},
 "DAY_NUMBER_IN_MONTH": {},
 "CALENDAR_WEEK_NUMBER": {},
 "FISCAL_WEEK_NUMBER": {},
 "WEEK_ENDING_DAY": {},
 "WEEK_ENDING_DAY_ID": {},
 "CALENDAR_MONTH_NUMBER": {},
 "FISCAL_MONTH_NUMBER": {},
 "CALENDAR_MONTH_DESC": {},
 "CALENDAR_MONTH_ID": {},
 "FISCAL_MONTH_DESC": {},
 "FISCAL_MONTH_ID": {},
 "DAYS_IN_CAL_MONTH": {},
 "DAYS_IN_FIS_MONTH": {},
 "END_OF_CAL_MONTH": {},
 "END_OF_FIS_MONTH": {},
 "CALENDAR_MONTH_NAME": {},
 "FISCAL_MONTH_NAME": {},
 "CALENDAR_QUARTER_DESC": {},
 "CALENDAR_QUARTER_ID": {},
 "FISCAL_QUARTER_DESC": {},
 "FISCAL_QUARTER_ID": {},
 "DAYS_IN_CAL_QUARTER": {},
 "DAYS_IN_FIS_QUARTER": {},
 "END_OF_CAL_QUARTER": {},
 "END_OF_FIS_QUARTER": {},
 "CALENDAR_QUARTER_NUMBER": {},
 "FISCAL_QUARTER_NUMBER": {},
 "CALENDAR_YEAR": {},
 "CALENDAR_YEAR_ID": {},
 "FISCAL_YEAR": {},
 "FISCAL_YEAR_ID": {},
 "DAYS_IN_CAL_YEAR": {},
 "DAYS_IN_FIS_YEAR": {},
 "END_OF_CAL_YEAR": {},
 "END_OF_FIS_YEAR": {}
}

Honestly, I’m not too happy to keep TIMES and CHANNEL collection separate. However, if I try to embed them in other collection (COST/SALES), I would not be able to create a new TIMES or CHANNEL document without having a COST or SALES for it.

SALES

{
 "PROD_ID": {},
 "CUST_ID": {},
 "TIME_ID": {},
 "CHANNEL_ID": {},
 "PROMO_ID": {},
 "QUANTITY_SOLD": {},
 "AMOUNT_SOLD": {}
}

I would shard the SALES collection on TIME_ID, CHANNEL_ID and PROMO_ID. Based on the assumptions previously, I would assume that majority of the query will filter by TIME_ID, PROMO_ID and CHANNEL_ID.

Summary

On the example above, I assumed that most of the queries will be revolving around SALES collection. On various collections above, I tried to guess the collection growth size and decide if I should embed the collection or use ID reference instead. Obviously, if my assumptions are wrong, the queries would be sub-optimal.

As you might have noticed, just because MongoDB doesn’t have a schema restriction, it doesn’t mean that you don’t need data modelling. In fact, it’s a paradigm shift for schema design between relational database and MongoDB. In MongoDB, you need to design your schema based on the type of questions that your application will be asking.

Now, if you are SQL developer, you will notice that there will be a need to join some of the collections together for BI reports. If so, you will need to do that at the application layer. If you are looking for a single left outer join, use $lookup instead.

At the end of the day, MongoDB is not suitable for all application! Yes, you can now use $Graphlook for finding relationships between objects. But you wouldn’t want to based your brand new Uber-For-Facebook(the friend suggestion feed) on MongoDB. It is clearly not designed to do that. In the era of NoSQL, use the right database for the correct problem.

References:

P.S: Do let you know if you have any other ideas on modelling the sample schema. I would love to hear your thoughts on this!

Regards,
Wei Shan

Leave a comment

Compound Index Vs Multiple Single Field Index in MongoDB 3.x – Part II

Compound Index Vs Multiple Single Field Index in MongoDB 3.x – Part II

After the first blog post was published, it was suggested to me that I should run the test with w:1 j:true. It make sense since no one would or should be running workloads with w:1, j:false.

By definition:

With j: true, MongoDB returns only after the requested number of members, including the primary, have written to the journal. Previously j: true write concern in a replica set only requires the primary to write to the journal, regardless of the w: <value> write concern.

If you are using MMapv1, then you should be using j:true else you risk data loss. If you are using WiredTiger, you will only lose at most 60 seconds of data between the checkpoints.

Insert Performance

No Index

After 295 seconds, 5754368 new records inserted - collection has 5754368 in total
20017 inserts per second since last report 99.62 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~19,763.2 inserts/second

Compound Index

After 295 seconds, 4497408 new records inserted - collection has 4497408 in total
14131 inserts per second since last report 98.63 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~15,245.4 inserts/second (22% performance reduction)

Single-Field Index

After 295 seconds, 2438144 new records inserted - collection has 2438144 in total
6758 inserts per second since last report 35.55 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~8,264.8 inserts/second (58% performance reduction)

Summary

The drop in performance ratio has reduced from when j:false. Previously it was about 40% performance drop with compound index and 78% performance drop with single-field indexes. Turns out, compound index isn’t too bad actually, taking into consideration that I created a 9 field compound index on the collection.

Regards,
Wei Shan

Leave a comment

Compound Index Vs Multiple Single Field Index in MongoDB 3.x

Compound Index Vs Multiple Single Field Index in MongoDB 3.x

MongoDB supports both compound index and single field index. It’s a common knowledge that index decrease your INSERTS/UPDATES performance but increases your SELECT performance. However, if you need to index multiple fields, is it better to use a compound index or just create multiple single field indexes? This post tries to investigate which is the better option in terms of performance.

Insert Performance

Test Setup:

  • Single MongoDB 3.2 instance running on my Mac
  • POC Driver uses single thread to perform inserts only
  • Test run for 5 mins
  • Empty collection after every run
  • Default readConcern
  • Default writeConcern

Sample Document:

> db.testCol.findOne()
{
 "_id" : {
 "w" : 2,
 "i" : 0
 },
 "fld0" : NumberLong(530752),
 "fld1" : ISODate("2015-06-02T08:14:18.687Z"),
 "fld2" : "tempor invidunt ut labore et",
 "fld3" : "eirmod tempor invidunt ut",
 "fld4" : "Lorem ipsum dolor sit amet,",
 "fld5" : ISODate("2016-10-12T01:58:24.580Z"),
 "fld6" : NumberLong(1712708),
 "fld7" : "sit amet. Lorem ipsum dolor",
 "fld8" : "sit amet, consetetur",
 "fld9" : NumberLong(652363),
 "bin" : BinData(0,"")
}

Phase 1 (No indexes except on _id field)

After 295 seconds, 13340672 new records inserted - collection has 13340672 in total
46498 inserts per second since last report 99.97 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~45,222 inserts/second

Phase 2 (Compound index on fld1 to fld9)

After 295 seconds, 7739904 new records inserted - collection has 7739904 in total
20830 inserts per second since last report 99.72 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~26,236 inserts/second

Phase 3 (Single field indexes on each fld2, fld6 and fld8)

After 295 seconds, 2998272 new records inserted - collection has 2998272 in total
7012 inserts per second since last report 63.61 % in under 50 milliseconds
0 keyqueries per second since last report 100.00 % in under 50 milliseconds
0 updates per second since last report 100.00 % in under 50 milliseconds
0 rangequeries per second since last report 100.00 % in under 50 milliseconds

~10,163 inserts/second

Storage Utilisation

On the same collection with 2,817,024 documents:

Compound Index

db.testCol.stats().indexSizes
{
 "_id_" : 75079680,
 "fld0_1_fld1_1_fld2_1_fld3_1_fld4_1_fld5_1_fld6_1_fl7_1_fld8_1_fld9_1" : 2617061376
}

~2.44GB

Single-Field Index

> db.testCol.stats().indexSizes
{
 "_id_" : 75079680,
 "fld0_1" : 65343488,
 "fld1_1" : 79781888,
 "fld2_1" : 35471360,
 "fld3_1" : 35475456,
 "fld4_1" : 35475456,
 "fld5_1" : 79781888,
 "fld6_1" : 65339392,
 "fld7_1" : 35475456,
 "fld8_1" : 35475456,
 "fld9_1" : 65347584,
}

~0.53GB

Index Selectivity Usage

For single-field index, there’s no limitation at all. MongoDB optimiser will pick and choose whichever index it deems fit. For compound index, it’s slightly different.

Query like these will not use the compound index:

  • db.testCol.find({fld9:1,fld8:1,fld7:1})
  • db.testCol.find({fld9:1,fld3:1,fld6:1})
  • db.testCol.find({fld9:1,fld8:1,fld7:1})

Query like these will use the compound index:

  • db.testCol.find({fld0:1,fld1:1})
  • db.testCol.find({fld5:1,fld8:1,fld0:1}) (not very efficient..)
  • db.testCol.find({fld0:1,fld3:1,fld5:1})

The general idea is that MongoDB will use it the compound index if it matches the index prefix. If it doesn’t match it correctly, it might still use the index if it is still more efficient that performing a full collection scan. The only way to find out is educate and test the queries yourself!

Summary

A very short summary in key points below:

  1. No index has the best insert performance (duh!!)
  2. Compound index is faster than Single-field index in insert performance
  3. Compound index uses more storage than Single-field index

Depending on the situation, choose your poison carefully 🙂

References:

P.S: The above test does not attempt to quantify the absolute performance numbers but rather provide a good performance degradation ratio for indexes

Regards,
Wei Shan

Leave a comment

PostgreSQL 9.6 on RHEL 7.2 -FATAL: semctl(SETVAL) failed: Invalid argument

PostgreSQL 9.6 on RHEL 7.2 -FATAL: semctl(SETVAL) failed: Invalid argument

The error below started occurring once we upgraded to PostgreSQL 9.6 on RHEL 7.2. It was strange because all we did was had a user to login, check the status of pgsql and logout.

 Dec 7 22:50:01 hostname systemd: Created slice user-1004.slice.
 Dec 7 22:50:01 hostname systemd: Starting user-1004.slice.
 Dec 7 22:50:01 hostname systemd: Started Session 2243 of user postgres.
 Dec 7 22:50:01 hostname systemd: Starting Session 2243 of user postgres.
 Dec 7 22:50:01 hostname systemd: Created slice user-0.slice.
 Dec 7 22:50:01 hostname systemd: Starting user-0.slice.
 Dec 7 22:50:01 hostname systemd: Started Session 2244 of user root.
 Dec 7 22:50:01 hostname systemd: Starting Session 2244 of user root.
 Dec 7 22:50:01 hostname postgres[12064]: [432-1] [unknown]: [local]: 58481389.2f20: LOG: connection received: host=[local]
 Dec 7 22:50:01 hostname postgres[12064]: [433-1] postgres: [local]: 58481389.2f20: FATAL: semctl(153976863, 2, SETVAL, 0) failed: Invalid argument
 Dec 7 22:50:01 hostname postgres[21799]: [432-1] : : 58444abd.5527: LOG: server process (PID 12064) exited with exit code 1
 Dec 7 22:50:01 hostname postgres[21799]: [433-1] : : 58444abd.5527: LOG: terminating any other active server processes
 Dec 7 22:50:01 hostname postgres[11514]: [434-1] repmgr: <IP_ADDRESS>(41171): 58480edb.2cfa: WARNING: terminating connection because of crash of another server process
 Dec 7 22:50:01 hostname postgres[11514]: [434-2] repmgr: <IP_ADDRESS>(41171): 58480edb.2cfa: DETAIL: The postmasteIP_ADDRESSr has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
 Dec 7 22:50:01 hostname postgres[11514]: [434-3] repmgr: <IP_ADDRESS>(41171): 58480edb.2cfa: HINT: In a moment you should be able to reconnect to the database and repeat your command.
 Dec 7 22:50:01 hostname postgres[12074]: [434-1] [unknown]: ::1(48077): 58481389.2f2a: LOG: connection received: host=::1 port=48077
 Dec 7 22:50:01 hostname postgres[12074]: [435-1] repmgr: ::1(48077): 58481389.2f2a: FATAL: the database system is in recovery mode
 Dec 7 22:50:01 hostname Backup: [INFO]DB check start.
 Dec 7 22:50:01 hostname postgres[21799]: [434-1] : : 58444abd.5527: LOG: all server processes terminated; reinitializing
 Dec 7 22:50:01 hostname postgres[21799]: [435-1] : : 58444abd.5527: LOG: could not remove shared memory segment "/PostgreSQL.1721076088": No such file or directory
 Dec 7 22:50:01 hostname postgres[21799]: [436-1] : : 58444abd.5527: LOG: semctl(152961024, 0, IPC_RMID, ...) failed: Invalid argument
 Dec 7 22:50:01 hostname postgres[21799]: [437-1] : : 58444abd.5527: LOG: semctl(152993793, 0, IPC_RMID, ...) failed: Invalid argument

After researching on the issue, we found this bug :).

Resolution

1) Set RemoveIPC=no in /etc/systemd/logind.conf

2) Reboot the server or restart systemd-logind as follows:
 # systemctl daemon-reload
 # systemctl restart systemd-logind

Reference:

  1. https://bugzilla.redhat.com/show_bug.cgi?id=1287618
  2. https://bugzilla.redhat.com/show_bug.cgi?id=1264533

Note:

  • Please note that I have seen this RHEL 7 issue with Oracle database too. It’s a RHEL 7 bug so it should applies to most database using IPC calls.

Hope this helps.

Regards,
Wei Shan

Leave a comment