Migrating Oracle Sample Schema to MongoDB

Migrating Oracle Sample Schema to MongoDB

It is not an easy feat to migrate your current application using Oracle or any relational database to MongoDB. You will need to remodel your schema to fit into MongoDB. If you are migrating from a relational database to another relational database, it is a lot easier (Still not that easy though!). In this blog post, I try to migrate the famous Oracle database sample schema to MongoDB. I’m not saying that from an application POV(think ACID!), this is a good thing, I’m illustrating the fact that you can’t just take a schema from a relational database and dump it onto MongoDB.

Let’s take a look at the ERD diagram of sales history schema.

sh schema

The sales history schema above represents the sales history for a company.

The sample company does a high volume of business, so it runs business statistics reports to aid in decision support. Many of these reports are time-based and nonvolatile. That is, they analyse past data trends. The company loads data into its data warehouse regularly to gather statistics for these reports. These reports include annual, quarterly, monthly, and weekly sales figures by product. The company also runs reports on distribution channels through which its sales are delivered. When the company runs special promotions on its products, it analyse the impact of the promotions on sales. It also analyse sales by geographical area.

Based on my understanding above, there’s a few assumptions to be made here:

  1. This is a data warehouse schema, not OLTP workload
  2. The application query distribution channels through which its sales are delivered.
  3. The application query about promotion impact on sales
  4. The application query sales history by geographical area

Rules to follow in MongoDB data modelling

MongoDB understands that people who are new to NoSQL would not understand how to data model efficiently. They have an entire chapter of documentation dedicated to it! Some of the limitations or rules you should be aware of:

MongoDB “Sales History” Schema – Version 1

Based on the MongoDB official documentation in data modelling, the following would be good enough.

SALES

{
 "_id": {
 "PROD": {},
  "PROD_ID": {},
  "PROD_NAME": {},
  "PROD_DESC": {},
  "PROD_SUBCATEGORY": {},
  "PROD_SUBCATEGORY_ID": {},
  "PROD_SUBCATEGORY_DESC": {},
  "PROD_CATEGORY": {},
  "PROD_CATEGORY_ID": {},
  "PROD_CATEGORY_DESC": {},
  "PROD_WEIGHT_CLASS": {},
  "PROD_UNIT_OF_MEASURE": {},
  "PROD_PACK_SIZE": {},
  "PROD_SUPPLIER_ID": {},
  "PROD_STATUS": {},
  "PROD_LIST_PRICE": {},
  "PROD_MIN_PRICE": {},
  "PROD_TOTAL": {},
  "PROD_TOTAL_ID": {},
  "PROD_SRC_ID": {},
  "PROD_EFF_FROM": {},
  "PROD_EFF_TO": {},
  "PROD_VALID": {},
  "PROD_COST": {
  "TIME_DESC": {},
  "UNIT_COST": {},
  "UNIT_PRICE": {}
 }
 },
 "CUST": {
  "CUST_ID": {},
  "CUST_FIRST_NAME": {},
  "CUST_LAST_NAME": {},
  "CUST_GENDER": {},
  "CUST_YEAR_OF_BIRTH": {},
  "CUST_MARITAL_STATUS": {},
  "CUST_STREET_ADDRESS": {},
  "CUST_POSTAL_CODE": {},
  "CUST_CITY": {},
  "CUST_CITY_ID": {},
  "CUST_STATE_PROVINCE": {},
  "CUST_STATE_PROVINCE_ID": {},
  "COUNTRY": {
  "COUNTRY_ID": {},
   "COUNTRY_ISO_CODE": {},
   "COUNTRY_NAME": {},
   "COUNTRY_SUBREGION": {},
   "COUNTRY_SUBREGION_ID": {},
   "COUNTRY_REGION": {},
   "COUNTRY_REGION_ID": {},
   "COUNTRY_TOTAL": {},
   "COUNTRY_TOTAL_ID": {},
   "COUNTRY_NAME_HIST": {}
 },
  "CUST_MAIN_PHONE_NUMBER": {},
  "CUST_INCOME_LEVEL": {},
  "CUST_CREDIT_LIMIT": {},
  "CUST_EMAIL": {},
  "CUST_TOTAL": {},
  "CUST_TOTAL_ID": {},
  "CUST_SRC_ID": {},
  "CUST_EFF_FROM": {},
  "CUST_EFF_TO": {},
  "CUST_VALID": {}
 },
 "TIME": {
  "TIME_ID": {},
  "DAY_NAME": {},
  "DAY_NUMBER_IN_WEEK": {},
  "DAY_NUMBER_IN_MONTH": {},
  "CALENDAR_WEEK_NUMBER": {},
  "FISCAL_WEEK_NUMBER": {},
  "WEEK_ENDING_DAY": {},
  "WEEK_ENDING_DAY_ID": {},
  "CALENDAR_MONTH_NUMBER": {},
  "FISCAL_MONTH_NUMBER": {},
  "CALENDAR_MONTH_DESC": {},
  "CALENDAR_MONTH_ID": {},
  "FISCAL_MONTH_DESC": {},
  "FISCAL_MONTH_ID": {},
  "DAYS_IN_CAL_MONTH": {},
  "DAYS_IN_FIS_MONTH": {},
  "END_OF_CAL_MONTH": {},
  "END_OF_FIS_MONTH": {},
  "CALENDAR_MONTH_NAME": {},
  "FISCAL_MONTH_NAME": {},
  "CALENDAR_QUARTER_DESC": {},
  "CALENBDAR_QUARTER_ID": {},
  "FISCAL_QUARTER_DESC": {},
  "FISCAL_QUARTER_ID": {},
  "DAYS_IN_CAL_QUARTER": {},
  "DAYS_IN_FIS_QUARTER": {},
  "END_OF_CAL_QUARTER": {},
  "END_OF_FIS_QUARTER": {},
  "CALENDAR_QUARTER_NUMBER": {},
  "FISCAL_QUARTER_NUMBER": {},
  "CALENDAR_YEAR": {},
  "CALENDAR_YEAR_ID": {},
  "FISCAL_YEAR": {},
  "FISCAL_YEAR_ID": {},
  "DAYS_IN_CAL_YEAR": {},
  "DAYS_IN_FIS_YEAR": {},
  "END_OF_CAL_YEAR": {},
  "END_OF_FIS_YEAR": {}
 },
 "CHANNEL": {
  "CHANNEL_ID": {},
  "CHANNEL_DESC": {},
  "CHANNEL_CLASS": {},
  "CHANNEL_CLASS_ID": {},
  "CHANNEL_TOTAL": {},
  "CHANNEL_TOTAL_ID": {}
 },
 "PROMO": {
  "PROMO_ID": {},
  "PROMO_NAME": {},
  "PROMO_SUBCATEGORY": {},
  "PROMO_SUBCATEGORY_ID": {},
  "PROMO_CATEGORY": {},
  "PROMO_CATEGORY_ID": {},
  "PROMO_COST": {},
  "PROMO_BEGIN_DATE": {},
  "PROMO_END_DATE": {},
  "PROMO_TOTAL": {},
  "PROMO_TOTAL_ID": {}
 },
 "QUANTITY_SOLD": {},
 "AMOUNT_SOLD": {}
}

However, this design pattern will result into 1 big massive JSON document that is slow. It is also more likely to encounter the MongoDB limitations listed earlier. What is a better approach then?

MongoDB “Sales History” Schema – Version 2

Let’s go with the easy bits first.

CUSTOMER

{
 "_id": {},
 "CUST_FIRST_NAME": {},
 "CUST_LAST_NAME": {},
 "CUST_GENDER": {},
 "CUST_YEAR_OF_BIRTH": {},
 "CUST_MARITAL_STATUS": {},
 "CUST_STREET_ADDRESS": {},
 "CUST_POSTAL_CODE": {},
 "CUST_CITY": {},
 "CUST_STATE_PROVINCE": {},
 "COUNTRY": {
  "COUNTRY_NAME": {},
  "COUNTRY_ISO_CODE": {},
  "COUNTRY_SUBREGION": {},
  "COUNTRY_REGION": {},
  "COUNTRY_TOTAL": {},
  "COUNTRY_NAME_HIST": {}
 },
 "CUST_MAIN_PHONE_NUMBER": {},
 "CUST_INCOME_LEVEL": {},
 "CUST_CREDIT_LIMIT": {},
 "CUST_EMAIL": {},
 "CUST_TOTAL": {},
 "CUST_SRC_ID": {},
 "CUST_EFF_FROM": {},
 "CUST_EFF_TO": {},
 "CUST_VALID": {}
}

Simply follow this one to one relationship design pattern for the CUSTOMER collection.

PRODUCT

{
 "_id": {},
 "PROD_NAME": {},
 "PROD_DESC": {},
 "PROD_SUBCATEGORY": {
  "PROD_SUBCATEGORY_ID": {},
  "PROD_SUBCATEGORY_DESC": {}
 },
 "PROD_CATEGORY": {
  "PROD_CATEGORY_ID": {},
  "PROD_CATEGORY_DESC": {}
 },
 "PROD_WEIGHT_CLASS": {},
 "PROD_UNIT_OF_MEASURE": {},
 "PROD_PACK_SIZE": {},
 "PROD_SUPPLIER_ID": {},
 "PROD_STATUS": {},
 "PROD_LIST_PRICE": {},
 "PROD_MIN_PRICE": {},
 "PROD_TOTAL": {},
 "PROD_TOTAL_ID": {},
 "PROD_SRC_ID": {},
 "PROD_EFF_FROM": {},
 "PROD_EFF_TO": {},
 "PROD_VALID": {},
 "COST": {
  "TIME_ID": {},
  "PROMO_ID": {},
  "CHANNEL_ID": {},
  "UNIT_COST": {},
  "UNIT_PRICE": {}
 }
}

PROMOTION

{
 "_id": {},
 "PROMO_NAME": {},
 "PROMO_SUBCATEGORY": {},
 "PROMO_CATEGORY": {},
 "PROMO_COST": {},
 "PROMO_BEGIN_DATE": {},
 "PROMO_END_DATE": {},
 "PROMO_TOTAL": {}
}

I have chose to combine PRODUCTS and COSTS into a single PRODUCT collection. You may find this design pattern over here. I did not choose to have the PROMOTION in the same collection. Instead, I decided to use references. The reason is because a product might have hundreds or thousands of promotion, an embedded document here would not be scalable for the PRODUCT collection.

CHANNELS

{
 "_id": {},
 "CHANNEL_DESC": {},
 "CHANNEL_CLASS": {},
 "CHANNEL_CLASS_ID": {},
 "CHANNEL_TOTAL": {},
 "CHANNEL_TOTAL_ID": {}
}

TIMES

{
 "_id": {},
 "DAY_NAME": {},
 "DAY_NUMBER_IN_WEEK": {},
 "DAY_NUMBER_IN_MONTH": {},
 "CALENDAR_WEEK_NUMBER": {},
 "FISCAL_WEEK_NUMBER": {},
 "WEEK_ENDING_DAY": {},
 "WEEK_ENDING_DAY_ID": {},
 "CALENDAR_MONTH_NUMBER": {},
 "FISCAL_MONTH_NUMBER": {},
 "CALENDAR_MONTH_DESC": {},
 "CALENDAR_MONTH_ID": {},
 "FISCAL_MONTH_DESC": {},
 "FISCAL_MONTH_ID": {},
 "DAYS_IN_CAL_MONTH": {},
 "DAYS_IN_FIS_MONTH": {},
 "END_OF_CAL_MONTH": {},
 "END_OF_FIS_MONTH": {},
 "CALENDAR_MONTH_NAME": {},
 "FISCAL_MONTH_NAME": {},
 "CALENDAR_QUARTER_DESC": {},
 "CALENDAR_QUARTER_ID": {},
 "FISCAL_QUARTER_DESC": {},
 "FISCAL_QUARTER_ID": {},
 "DAYS_IN_CAL_QUARTER": {},
 "DAYS_IN_FIS_QUARTER": {},
 "END_OF_CAL_QUARTER": {},
 "END_OF_FIS_QUARTER": {},
 "CALENDAR_QUARTER_NUMBER": {},
 "FISCAL_QUARTER_NUMBER": {},
 "CALENDAR_YEAR": {},
 "CALENDAR_YEAR_ID": {},
 "FISCAL_YEAR": {},
 "FISCAL_YEAR_ID": {},
 "DAYS_IN_CAL_YEAR": {},
 "DAYS_IN_FIS_YEAR": {},
 "END_OF_CAL_YEAR": {},
 "END_OF_FIS_YEAR": {}
}

Honestly, I’m not too happy to keep TIMES and CHANNEL collection separate. However, if I try to embed them in other collection (COST/SALES), I would not be able to create a new TIMES or CHANNEL document without having a COST or SALES for it.

SALES

{
 "PROD_ID": {},
 "CUST_ID": {},
 "TIME_ID": {},
 "CHANNEL_ID": {},
 "PROMO_ID": {},
 "QUANTITY_SOLD": {},
 "AMOUNT_SOLD": {}
}

I would shard the SALES collection on TIME_ID, CHANNEL_ID and PROMO_ID. Based on the assumptions previously, I would assume that majority of the query will filter by TIME_ID, PROMO_ID and CHANNEL_ID.

Summary

On the example above, I assumed that most of the queries will be revolving around SALES collection. On various collections above, I tried to guess the collection growth size and decide if I should embed the collection or use ID reference instead. Obviously, if my assumptions are wrong, the queries would be sub-optimal.

As you might have noticed, just because MongoDB doesn’t have a schema restriction, it doesn’t mean that you don’t need data modelling. In fact, it’s a paradigm shift for schema design between relational database and MongoDB. In MongoDB, you need to design your schema based on the type of questions that your application will be asking.

Now, if you are SQL developer, you will notice that there will be a need to join some of the collections together for BI reports. If so, you will need to do that at the application layer. If you are looking for a single left outer join, use $lookup instead.

At the end of the day, MongoDB is not suitable for all application! Yes, you can now use $Graphlook for finding relationships between objects. But you wouldn’t want to based your brand new Uber-For-Facebook(the friend suggestion feed) on MongoDB. It is clearly not designed to do that. In the era of NoSQL, use the right database for the correct problem.

References:

P.S: Do let you know if you have any other ideas on modelling the sample schema. I would love to hear your thoughts on this!

Regards,
Wei Shan

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: