Is BI + NoSQL = NoGo?
Successful leaders stay ahead of the competition, by making educated decisions based on hard data. In some cases, organization's data is so vast (Big Data), that Microsoft Excel just won’t cut it. If that’s your situation, you will probably want to consider a Business Intelligence (hereafter BI) solution.
One important aspect when evaluating a BI tool, is to check if the BI tool can work well with all data sources in your organization (existing and planned). In this post, we will explore one prevalent data source, the NoSQL database. We will see how well BI tools and NoSQL databases work together. Spoiler alert: there is an Impedance Mismatch between the two.
We can start with a quick introduction to BI and NoSQL as it will help better understand the main problem. The amount of data available for business this days, is vast and growing rapidly year after year. So is the need to transform raw data, into meaningful and digestible information. One that businesses can act upon. So, if the data is available for for analysis, why rely on gut feelings when making business decisions? This is where BI comes into the picture, allowing stakeholders to make conscious decisions based on hard data. This is how I see BI fitting in to the broader picture:
* The data warehousing layer is optional but highly recommended in an enterprise environment.
Over the years, many small to large companies entered the BI market. To name a few: Microsoft (Deltallegro), IBM (Cognos), SAP (business objects), Oracle (Hyperion), MicroStratagy, SAS. For further reading, you can refer to one of the reputable sources of information about the business intelligence market: “Gartner BI Magic Quadrant”.
As for NoSQL databases, the key point here is that most (if not all) NoSQL databases are schema-less. Also, please note that in this post, I refer mainly to the ‘Document Store’ type of NoSQL databases (though, some will be applicable to other types like the Key-Value Store and Column-based Store). With that in mind, we can continue to describe the problem.
BI & NoSQL – Impedance Mismatch:
The crux of the matter is that in most cases, NoSQL databases are not easily consumable by BI solutions. On one hand, BI tools usually expect tabular data (like we get from relational databases). Those tools use semantic models for analysis, for which a schema is needed. On the other hand, NoSQL databases being schemaless, do not expose the data in such a way (SQL like interface).
Here is one example of the incompatibility of BI & NoSQL. The following is an excerpt taken from a thesis done with regards to Tibco Spotfire:
“While Spotfire includes tools analyzing and visualizing data, the data itself is supplied by the user. The different systems from which one can import data into Spotfire is a key competition point for the product. Today, Spotfire supports importing data from several RDBMSs, spreadsheets and CSV4 -like formats. However, there is currently no support for any NoSQL alternative.”
Source: Extracting Data from NoSQL Databases - A Step towards Interactive Visual Analysis of NoSQL Data
OK, so what can we do about it? There is no panacea, but some of the following options can serve as a workarounds.
Open Database Connectivity (ODBC) driver.
Using an ODBC driver on top of NoSQL database will help expose the relational interface that we need:
The ODBC driver will function as the adapter needed between the NoSQL database and the BI tool. Here are some advantages and disadvantage:
- the ODBC driver introduces an abstraction layer, so the NoSQL database and BI tool will be decoupled.
- the ODBC driver provides fine granularity in the schema definition. We can choose which metadata fields, from the JSON document, to expose to the BI tool. For example, you can expose employee demographic details and intentionally omit the compensation history (in the case this information is stored in one document).
- on the down side we get added complexity to the overall solution by adding this middleware.
That sounded like a good option, but there was a catch... I guess everything comes with a price. In order for us to use the ODBC driver, we must pre-configure the schema (regardless of the vendor). The problem with this initial step is that once the schema definition is created, it will not change on the fly. That is, a new schema we not be accounted for. That fact hinders one of the main perks NoSQL databases provides us. The elasticity of the ‘schema’. That is, the ability to store documents with different metadata structure (not all of the ‘Person’ objects will have a Fax number!). While this does not affect all applications, it will impose a severe limitation for applications that rely on dynamic schema.
As I understand, the way it works is that the driver will infer the schema from the connection (like what Simba did on this aspect). How you ask? The driver will take the first N rows as a representing set and the schema will based on the metadata extracted from that set (this is where the issue emerges).
We can look at following example to demonstrate the issue:
"name": "A green door",
"tags": ["home", "green"]
"name": "A blue door",
"tags": ["home", "blue"]
"name": "A yellow door",
"tags": ["home", "yellow"]
"name": "A white door",
"tags": ["home", "white"]
In this case, the “height” tag (in document N+1) will not be discovered by the driver and will be reflected in the schema. To generalize, any additional fields in the documents beyond the Nth document, will not be included in the schema.
To summarize, if the data model is fairly static in the NoSQL database, we can get away with this approach and use the ODBC driver.
Here are a few of the ODBC drivers available (none of which were certified by MongoDB):
- Simba technologies
- This one is an open source driver and at this point still immature: https://github.com/NYUITP/sp13_10g
This option is for MongoDB users:
The following option is similar to the previously mentioned ODBC drivers, but is it specific to MongoDB. On June 2 2015, “MongoDB announced a new connector for BI and visualization, which connects MongoDB to industry-standard business intelligence (BI) and data visualization tools. Designed to work with every SQL-compliant data analysis tool on the market, including Tableau, SAP Business Objects, Qlik and IBM Cognos Business Intelligence. The connector is currently in preview and expected to become generally available in the fourth quarter of 2015”. Kelly Stirman, VP of strategy at MongoDB explained: “The connector is compatible with any SQL-based ODBC based tool — pretty much everything out there”
- New MongoDB Connector Creates Direct Link To Data Visualization Tools
- Tableau & MongoDB: Visual Analytics on JSON at the Speed of Thought
- Opening modern application data to new generation visual analysis and traditional BI tools
I contacted the MongoDB support team to understand if and how this new connector will be different from any other ODBC driver (like Simba’s). From their reply, I gathered that MongoDB is aiming high and building a connector that will offer the best performance possible. Hopefully this will be the case, as it will eliminate the need for a third party ODBC connector.
A couple of important points to mention:
- This connector suffers from the same issue as the other ODBC drivers (static schema inference). See the ODBC section for more information.
- Even though this new connector will expose SQL like interface to BI tools, it is only highly optimized for Tableau (at least at the beginning). That can lead to less than optimal query performance for other BI tools.
- The connector will be ‘Read only’ (that is, one way). Other ODBC drivers like Simba’s, can write to the database, however it seems that there is no mainstream use case for writing to the DB when using BI tools.
Extract, Transform and Load (ETL) set of tools:
You can read more about it here. This set of tools usually goes hand in hand with Data Warehousing. It functions as the bridge between the different data sources and the data warehousing layer (see the pyramid below).
Some of the advantages of the ETL approach:
- The Extraction phase, consumes as little resources as possible from the data repositories and the data warehousing systems.
- A Transformation can be applied to the data before loading it to the target. For example, joining data from several sources.
- Load & Extract are somewhat mirrored phases. As in the Extract phase, the Load phase allows for efficient data population of the Data Warehousing layer.
Note that phases can overlap, that is, executed concurrently.
In the case that a BI client queries the data sources directly, we get the up-to-date information. That perk comes with a price, a performance hit to production (live) systems. To avoid this performance hit, we can use the ETL approach, where the BI client is querying the Data Warehousing layer. The downside is that we might get stale data. It relies on the frequency at which the ETL process is executed. This introduces a trade off between the performance and the ‘freshness’ of the data. Expanding on the previous pyramid:
Some of the ETL tools out there:
- Pentaho Kettle (a Hitachi data system company):
- Microsoft SQL Server Integration Services (SSIS)
- See WiKi for additional vendors
Export the NoSQL data to CSV files. While not a practical option for enterprise solutions, I thought I would mention it as it might be applicable in some situations. With this workaround, we lose the dynamic nature of the data in the database, even if we script the export with some reasonable frequency. However, if your data set is small you might get away with it.
Apache Hive (usually on top of Hadoop):
This will not work well if real-time reporting is essential. See here for further information “Cloudera Impala vs Spark/Shark vs Apache Drill”
I hope this helps.