A neutral perspective of the current database industry

A neutral perspective of the current database industry

According to db-engines, there are more than 317 database software in the market right now. If you have requirements to develop a new application , how do you know which database to use? When I started my career in 2011, the de facto standard was to use Oracle database. It was mature, reliable and able to support multiple types of workload. If you can’t afford Oracle database, you could use MySQL, if you require high performance with HA capabilities. If you require a mature product with strict compliant to ANSI SQL standards, there’s always PostgreSQL. However, Oracle is still the #1 choice for a lot of reasons. Also, Oracle DBAs were also paid the most, compared to other databases DBA.

Five years has passed, a lot has changed in the database industry. Even though Oracle is still the #1 in the market, there are many other comparable choices now. Oracle supporters will claim that Oracle database is still the best and no other database products will come close. They will also claim that NoSQL products are overhyped and will die down soon enough. Open-source supporters will spread hate on Oracle, claim that it’s overpriced and that you don’t need Oracle database at all.

In my opinion, enterprise companies AKA companies whom are loaded with cash, will still continue to pay 7 figures for Oracle support contract. They continue to do so because they are not willing to migrate their legacy application to something else. Also, It doesn’t make sense for them to use other database software when they have unlimited license usage for Oracle software. It’s also possible that the CIO aren’t willing to risk his/her neck by changing the database stack 🙂 For companies that aren’t as rich, usually their ERP software will still be stuck using Oracle database. That’s if they haven’t move to a cloud ERP yet. For new applications, they are already looking at alternatives.

Oracle is still the #1 database, in terms of technical superiority. No other databases vendors have implemented Oracle’s Real Application Cluster (RAC). It’s an extremely powerful feature that offers possibly, zero downtime for node failure. However, how many real life applications require that? Is it feasible to have that feature for an additional 23,000 USD/core? Is it acceptable to have 15s of downtime during node failure instead? That would decrease your cost to about nothing for licensing.

If you have a multiple terabyte application, the cost of having that on a Oracle database is not going to be cheap. A four-core EE licensed server is going to cost you at least 200,000 USD already. If you run it on MongoDB or MySQL, it’s not going to cost you anything. Obviously, your application needs to be able to able to support that (ACID compliant). If you help your company to save that 200k USD, you might find a sweeter bonus at the end of the year!

Oracle database has got a lot of great technology behind it. The database optimisation engine is far more impressive than any other vendors I know. They had parallel query even before version 8i. PostgreSQL only had it recently. However, Oracle database sales team has been trying to piss off their customers for awhile now.I know that several banks in Asean region are using PostgreSQL/MySQL for their new applications Also, many companies are banning new Oracle database implementations implicitly unless there’s no alternatives. People didn’t have much choice for choosing database software back then. It is vastly different now.

It’s a lot harder to pick the right database for new applications now. It used to be just Oracle and Oracle only. These days, you got to think of the nature and volume of the data/application. You also need to decide if your application requires ACID. If you are using NoSQL, there’s graph, key-value, document and columnar. You don’t really want to use the wrong database for your application. Using the wrong datastore for your application will result into sub-optimal performance or ugly workarounds that you wouldn’t like. For example, you could use MongoDB for text search. However, a better database for heavy text search use case would be Elasticsearch. Just because MongoDB have text indexes and is able to perform text search doesn’t mean it’s the right datastore for it. Heck, you could even use MySQL as a pure key-value storage. Databases is all about polyglot persistence these days.

I believe that in years to come, Oracle database will become the mainframe of database software. Highly profitable yet legacy piece of software whereby companies only use it because they were too afraid to use something else. For all my friends that still works actively on Oracle databases, perhaps it’s time to learn a new skill or move on to something else.

Regards,
Wei Shan

3 Comments

Tips on using MongoDB Shell

Tips on using MongoDB Shell

I’m currently taking the MongoDB online course, M202 right now. It’s an advance course that covers sharded cluster, performance tuning and others useful topics. Also, I have learnt a couple of tricks about mongo shell from the course so far.

show profile

On a database that has profiling enabled, the show profile helper in the mongo shell displays the 5 most recent operations that took at least 1 millisecond to execute. Issue show profile from the mongo shell, as follows:

show profile

This is particularly useful when you don’t want to go through the entire collection. MongoDB query syntax isn’t the most intuitive language in the world and it helps when you can just do a quick analysis of the slow queries. If you need something more detailed, try this.

show log

There are many ways to view MongoDB log files:

  1. cat mongod.log
  2. via Kibana
  3. mongo shell, show log

I find that using #3 is handy, especially during a severity 1 issue. It shows all the logs that is currently in memory. This also means it will only show relevant logs 🙂

Hope this helps!

Regards,
Wei Shan

Leave a comment

MongoDB 3.x – Maximum Oplog Size

MongoDB 3.x – Maximum Oplog Size

In the official MongoDB documentation, it is said that the maximum size for oplog is 50GB. With 5 years of Oracle database experience, I developed the bad habit of trusting official documentation entirely. Following the procedure to resize the oplog on my laptop, I was able to resize it to 60GB instead.

To check the current oplog size;

my_rs_1:PRIMARY> rs.printReplicationInfo() 

configured oplog size:   61440MB
log length start to end: 1568secs (0.44hrs)
oplog first event time:  Sun Jan 29 2017 17:55:26 GMT+0000 (GMT)
oplog last event time:   Sun Jan 29 2017 18:21:34 GMT+0000 (GMT)
now:                     Sun Jan 29 2017 18:25:29 GMT+0000 (GMT)

I have resized it to 60GB instead! 🙂

Working with MongoDB for the last 3 months has made me realised a lot of things that Oracle DBA has taken for granted. Things that is supposed to work the way it is documented seems to be a bonus instead of being a requirement.

P.S: On the hindsight, I believe the 50GB upper bound only applies for the default 5% and not the maximum configurable size.

Hope this helps!

Regards,
Wei Shan

Leave a comment

MongoDB 3.2 Sharded Cluster – rs.reconfig() does not work

MongoDB 3.2 Sharded Cluster – rs.reconfig() does not work

Following this document to reconfigure a sharded replica set does not work. The replica set seems to be confused. rs.status() show the correct output but db.serverStatus() show otherwise.

rs:PRIMARY> rs.status()
{
"set" : "rs",
"date" : ISODate("2017-01-18T03:05:06.202Z"),
"myState" : 1,
"term" : NumberLong(20),
"configsvr" : true,
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 1,
"name" : "node1:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 439,
"optime" : {
"ts" : Timestamp(1484708697, 1),
"t" : NumberLong(20)
},
"optimeDate" : ISODate("2017-01-18T03:04:57Z"),
"infoMessage" : "could not find member to sync from",
"electionTime" : Timestamp(1484708696, 1),
"electionDate" : ISODate("2017-01-18T03:04:56Z"),
"configVersion" : 83051,
"self" : true
}
],
"ok" : 1
}
rs:PRIMARY> db.serverStatus()
{
"host" : "node1:27017",
"process" : "mongod",
"repl" : {
"hosts" : [
"node1:27017"
],
"setName" : "rs",
"setVersion" : 83051,
"ismaster" : false,
"secondary" : true,
"primary" : "node1:27017",
"me" : "node1:27017",
"electionId" : ObjectId("7fffffff0000000000000014"),
"rbid" : 398723451

Error log on mongos

2017-01-17T21:38:40.143Z W NETWORK [ReplicaSetMonitorWatcher] No primary detected for set rs0
2017-01-17T21:38:40.144Z W NETWORK [ReplicaSetMonitorWatcher] No primary detected for set rs1
2017-01-17T21:38:50.440Z W NETWORK [ReplicaSetMonitorWatcher] No primary detected for set rs0
2017-01-17T21:39:00.762Z W NETWORK [ReplicaSetMonitorWatcher] No primary detected for set rs1
2016-12-20T21:18:27.060Z W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: "primary" } for set rs0(config)
2016-12-20T21:18:29.187Z W NETWORK [mongosMain] Failed to connect to , reason: errno:65 No route to host
2016-12-20T21:18:29.188Z W NETWORK [mongosMain] Failed to connect to , reason: errno:65 No route to host
2016-12-20T21:19:44.217Z W NETWORK [mongosMain] Failed to connect to after 5000 milliseconds, giving up.
2016-12-20T21:19:44.218Z W NETWORK [mongosMain] No primary detected for set rs0(config)
2016-12-20T21:19:44.218Z I SHARDING [mongosMain] Error initializing sharding state, sleeping for 2 seconds and trying again :: caused by :: FailedToSatisfyReadPreference: could not get updated shard list from config server due to could not find host matching read preference { mode: "nearest" } for set rs0(config)

Error log on the single node replica set mongod process

2017-01-17T23:50:36.601Z I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker 
2017-01-17T23:50:36.602Z I NETWORK [initandlisten] waiting for connections on port 27017 ssl 
2017-01-17T23:50:36.602Z I REPL [ReplicationExecutor] New replica set config in use: {...} 
2017-01-17T23:50:36.603Z I REPL [ReplicationExecutor] This node is in the config 
2017-01-17T23:50:36.603Z I REPL [ReplicationExecutor] transition to STARTUP2 
2017-01-17T23:50:36.603Z I REPL [ReplicationExecutor] Starting replication applier threads 
2017-01-17T23:50:36.604Z I REPL [ReplicationExecutor] transition to RECOVERING 
2017-01-17T23:50:36.608Z I REPL [ReplicationExecutor] transition to SECONDARY 
2017-01-17T23:50:36.608Z I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 
2017-01-17T23:50:36.608Z I REPL [ReplicationExecutor] dry election run succeeded, running for election 
2017-01-17T23:50:36.608Z I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 38 
2017-01-17T23:50:36.608Z I REPL [ReplicationExecutor] transition to PRIMARY 
2017-01-17T23:50:37.668Z I SHARDING [rsSync] Sharding state recovery process found document { _id: "minOpTimeRecovery", configsvrConnectionString: "rs0/", shardName: "rs0", minOpTime: { ts: Timestamp 1481206005000|6, t: 15 }, minOpTimeUpdaters: 0 } 
2017-01-17T23:50:37.669Z I SHARDING [ShardingState initialization] first cluster operation detected, adding sharding hook to enable versioning and authentication to remote servers 2017-01-17T23:50:37.690Z I SHARDING [ShardingState initialization] Updating config server connection string to: rs0/ 
2017-01-17T23:50:37.690Z I NETWORK [ShardingState initialization] Starting new replica set monitor for rs0/ 
2017-01-17T23:50:37.690Z I NETWORK [ReplicaSetMonitorWatcher] starting 
2017-01-17T23:50:37.703Z I SHARDING [thread1] creating distributed lock ping thread for process rs0/ (sleeping for 30000ms) 
2017-01-17T23:51:32.666Z I ACCESS [conn2] Successfully authenticated as principal user on admin 
2017-01-17T23:51:52.712Z W NETWORK [replSetDistLockPinger] Failed to connect to after 5000 milliseconds, giving up. 
2017-01-17T23:51:52.712Z W NETWORK [ShardingState initialization] Failed to connect to after 5000 milliseconds, giving up. 
2017-01-17T23:51:52.712Z W NETWORK [ShardingState initialization] No primary detected for set rs0(config) 2017-01-17T23:51:52.712Z W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: "primary" } for set rs0(config) 
2017-01-17T23:51:52.712Z I SHARDING [ShardingState initialization] Error initializing sharding state, sleeping for 2 seconds and trying again :: caused by :: FailedToSatisfyReadPreference: could not get updated shard list from config server due to could not find host matching read preference { mode: "nearest" } for rs0(config)

Root Cause:

Because the config database information isn’t updated in the shard members. We need to remove the cache data from  db.system.version in step #6 of this document.

Hope this helps.

Regards,
Wei Shan

Leave a comment

MySQL 5.7 – Index Condition Pushdown Optimization Bug

MySQL 5.7 – Index Condition Pushdown Optimisation Bug

Index Condition Pushdown (ICP) is an optimization for the case where MySQL retrieves rows from a table using an index. It uses the indexed columns to determine if they should retrieve the row instead of retrieving the entire table then filtering off the unnecessary rows.

Disable ICP if you are running a query with ORDER BY DESC against partitioned tables. It slows down the query considerably. You will definitely hit this bug if you are running zabbix server on MySQL database.

To disable ICP, use the following:

SET optimizer_switch='index_condition_pushdown=off';

Hope this helps!

Regards,
Wei Shan

References:

,

Leave a comment

Using strace in Linux to troubleshoot database performance issues

Using strace in Linux to troubleshoot database performance issues

Use Case:

Using top and iostat, we know that the database is slow due to IO constraints. What if we want to narrow down to the exact file causing the performance issues? What if we want to know what is the DB process waiting on?

Find the PID of the process

[root@hostname ~]# ps -aef | grep mysqld
mysql 16013 15513 23 Dec08 ? 1-00:17:41 /usr/sbin/mysqld --basedir=/usr --datadir=/data/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/libgalera_smm.so --log-error=/var/log/mysql/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock --wsrep_start_position=a09ed4ce-79bc-11e6-9b2e-63d190df3dd7:42170895

Find out which system calls took the longest time

[root@hostname ~]# strace -cp 16013
Process 16013 attached
^CProcess 16013 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.63 28.601334 147430 194 poll
 1.33 0.387070 387070 1 restart_syscall
 0.03 0.008057 22 359 futex
 0.00 0.000891 5 195 accept
 0.00 0.000369 2 195 fcntl
------ ----------- ----------- --------- --------- ----------------
100.00 28.997721 944 total

Attach strace to the PID to find out the file descriptor it is waiting for.

[root@hostname ~]# strace -p 16013
Process 16013 attached
restart_syscall(<... resuming interrupted call ...>) = 1
accept(34, {sa_family=AF_LOCAL, NULL}, [2]) = 128
fcntl(128, F_SETFD, FD_CLOEXEC) = 0
futex(0x1e937c4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e937c0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x1e93800, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=50, events=POLLIN}, {fd=34, events=POLLIN}], 2, 4294967295) = 1 ([{fd=34, revents=POLLIN}])
accept(34, {sa_family=AF_LOCAL, NULL}, [2]) = 128
fcntl(128, F_SETFD, FD_CLOEXEC) = 0
futex(0x1e937c4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1e937c0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x1e93800, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=50, events=POLLIN}, {fd=34, events=POLLIN}], 2, 4294967295) = 1 ([{fd=34, revents=POLLIN}])
accept(34, {sa_family=AF_LOCAL, NULL}, [2]) = 128
fcntl(128, F_SETFD, FD_CLOEXEC) = 0

Find the file that the file descriptor is pointing to!

ls -altr /proc/16013
lrwx------. 1 root root 64 Dec 12 15:46 50 -> /data/mysql/zabbix/history_uint#P#p201612160000.ibd
lrwx------. 1 root root 64 Dec 12 15:46 51 -> /data/mysql/zabbix/history_uint#P#p201612170000.ibd

Now we know mysqld process has been waiting for partitioned data file history_uint#P#p201612160000.ibd!

Hope this helps.

Regards,
Wei Shan

Leave a comment

2016 Year End Thoughts

2016 Year End Thoughts

I began this blog more than 4 years ago, on 8th September 2012. My life has undergone a massive change since then. 4 years ago, I started my first job working with Oracle databases on AIX/Solaris O/S. Today, I work with mainly MongoDB/PostgreSQL/MySQL on FreeBSD/Linux O/S. Within this short timespan, proprietary software market share has reduced massively. For every proprietary software, there is a open-source version out there. Albeit, it might not be as enterprise ready as the paid counterparts, but hey, it’s free! Also, database is a lot more cooler to work on now!

This year has been a life changing experience for me. I did a lot of freelance database consulting work, relocated to a different continent, also working with in a team with very different culture compared to mine. The tech industry in London is a lot more vibrant than Singapore. The average quality of engineers are a lot higher than back home. The best thing that happened to me professionally is that I realised that I wasn’t as good as I thought. I’m happy that I had this self realisation myself instead of having it shoved down my throat. (Think of China during the Qing dynasty!)

My professional motto is always to be the dumbest guy in the room. So far, I was never the dumbest guy. Fortunately, there was always this few guys that’s just a tad dumber that me. However, in this new team, I do feel like the dumbest guy in the room. It’s probably more of a fact than feeling. It is brutal and humbling and definitely not a pleasant experience. It doesn’t help emotionally when the working culture are forthright, meaning they are more than willing to engage in a point-blank technical conversation without any hesitation of being as blunt as they can be.

However, it does kindle this burning sensation within me to strive to be a better engineer. I do want to read as much as I can and catch up to their level. I will let my inner asian mindset shine here 🙂  2016 has been an amazing ride so far, I can’t wait for 2017 to happen. It’s going to be so much more exciting!

I wish everyone to have an awesome year in 2017!

Regards,
Wei Shan

Leave a comment

MongoDB 3.2 – Duplicating a collection with or without transformation

MongoDB 3.2 – Duplicating a collection with or without transformation

If you have a requirement to duplicate a collection, there’s multiple ways to do it. You could do an mongoexport/mongoimport, db.collection.copyTo, write a javascript loop function or simply, use aggregation to do it. The 2 method lacks option to perform any transformation. The last 2 option allows you to choose whether you want to perform any transformation.

Consider a books collection with the following document:

{
 "_id" : ObjectId("5836d55d500dsa1230f488ab0"),
 "a" : 1,
 "b" : "12345678",
 "c" : "12345678"
}

We want to duplicate the above, create another field, e, with the value of _id to another collection called books2. We could write a javascript using for loop (find.ForEach) to create initializeUnorderedBulkOp job, transform the fields and insert into another collection.

We could also use MongoDB aggregation to do it.

db.bookes.aggregate( [ { $project : { _id: 0, a:1, b:1, c:1, d: "$_id"} }, { $match: {} }, { $out : "books2" } ] )

1 line of code to do exactly what you need! Using my laptop(MBP with 16GB memory) as a test, with a million record, I achieved around ~12 times faster on aggregation than on javascript. Unlike copyTo(), there’s no database-level locking, so it’s friendly on your production system(Although it will mess up your cache with the big collection).

Personally, I like aggregation a lot. It feels a lot like Hadoop/MapReduce functions where the throughput is amazing but the latency is crap.

Regards,
Wei Shan

References:

  1. http://stackoverflow.com/questions/8933307/clone-a-collection-in-mongodb
  2. https://docs.mongodb.com/v3.2/reference/operator/aggregation/project/

Leave a comment