The “Big” in Big Data

Over the last year, many industry analysts have tried to define Big Data. Some of the common dimensions that have been used to define Big Data are the 3 V’s, Volume, Velocity and Variety. (Volume = multiple terabytes or over a petabyte; variety = numbers, audio, video, text, streams, weblogs, social media etc.; velocity = the speed with which it is collected). Although the 3 V’s do a good job as parameters for Big Data there are other things at play that need to be captured to understand the true nature of Big Data. In short, to describe the data landscape more holistically, we need to step beyond the 3 V’s. While the 3V’s are better classified as the salient features of the data, the real drivers of the Big Data are technology, economics and the tangible value that can be extracted from the data, in other words the business insights!

Here I want to take a closer look at some of the drivers of Big Data.


Big Data analysis requires processing huge volumes of data sets that are non-relational with a weak schema, at an extremely fast pace. This need sparked a sudden emergence of technologies like Hadoop that help to pre-process unstructured data on the fly and perform quick exploratory analytics. This model breaks away from the traditional approach of using procedural code and state management to manage transactions.

Along with new preprocessing technologies we have also seen the growth of alternate DBMS technologies like NoSQL and NewSQL that further help to analyze large chunks of data in non-traditional structures (for example using trees, graphs, or key-value pairs instead of tables.)

Other changes are happening on the infrastructure side of things. High performance and highly scalable architectures have been emerging. They include parallel processing, high-speed networking and fast I/O storage, which further help to process large volumes of data at a higher MB/s rate.

In addition to the technological changes we are also witnessing a fundamental paradigm shift in the way DBA’s and data architects are analyzing data. For example, instead of enforcing ACID (atomicity, consistency, isolation, durability) compliance across all database transactions we are seeing a more flexible approach on using ACID in terms of enforcing it whenever necessary and eventually designing a  consistent system in a more iterative fashion.


The emergence of these new technologies is further fueled by the economics associated with providing highly scalable business analytics solutions at a low cost. Hadoop comes to mind as the prime example. I found a valuable white paper that describes how to build a three node Hadoop solution using a Dell OptiPlex desktop PC running Linux as a master machine ( The solution was priced at <$5000.

These kinds of economics are driving a faster adoption of new technologies using off the shelf hardware. Thus enabling even a research scientist or a college student to easily re-purpose his hardware for trying out new software frameworks.

Business Insights:

I cannot stress enough the importance of business insights, also highlighted in my previous blog post (Business Intelligence: The Big Picture). Even as enterprises keep getting smarter at managing their data, they must realize that no matter how small or big their data set is, the true value of the data is realized only when they have produced actionable information (insights)! With this in mind, we must view the implementation of Big Data architectures as incomplete until the data has been analyzed to report out the actual actionable information to its users. Some examples of successful business insights implementations include (but are not limited to):

  • Recommendation engines: increase average order size by recommending complementary products based on predictive analysis for cross-selling (commonly seen on Amazon, Ebay and other online retail websites)
  • Social media intelligence: one of the most powerful use cases I have witnessed recently is the MicroStrategy Gateway application that lets enterprises combine their corporate view with a customer’s Facebook view
  • Customer loyalty programs: many prominent insurance companies have implemented these solutions to gather useful customer trends
  • Large-scale clickstream analytics: many ecommerce websites use clickstream analytics to correlate customer demographic information with their buying behavior.

The takeaway here is that enterprises should remain focused on the value their data can provide in terms of enabling them to make intelligent business decisions. So it important to have this holistic view that does not emphasize certain parameters related to Big Data to the detriment of others. In other words businesses have to keep in the mind the big picture. So how do you measure the impact of  a Big Data implementation for your organization?


Business Intelligence: The Big Picture

There is a lot of buzz about Hadoop, NoSQL, NewSQL and columnar MPP databases. But where is the actual value for businesses? Businesses need to have actionable information derived from their data that they collect on a regular basis. We know how to collect data and store them in databases of various kinds. We have seen the evolutions of SQL databases over the last five decades and databases have gotten sophisticated in terms of processing structured data. With the recent explosion of social media and with it the proliferation of unstructured data, new technologies have emerged, such as MapReduce. So now we have the data but the real question is where does the business value actually get delivered? The answer is simple. The value does not lie in the way data gets pulled into the database or how the database is optimized to handle new varieties of data. While these steps are important the value continues to be delivered at the Analytics and the end user Reporting layer as illustrated in the Business Intelligence value pyramid.


© image shree dandekar 12/6/2011

Rewind << for a second:

The 1990 era was all about capturing business relevant data, storing it using business constructs into a database. Typical use cases involved performing OLTP (Online Transaction processing) workloads on that data. We saw the evolution of Data Warehouses as enterprises started to seek out more analytical insights from the data stored in the database which gave rise to OLAP (Online analytics processing) workloads. Once in the data warehouse the data was cleansed, filtered and augmented with Business rules using some traditional ETL (Extract, Transform, Load) or Data Integration tools, thus removing any redundancies from the data as well as normalizing it. You would still have to run a Business Intelligence capability against this data to develop dashboards or reports to actually be able to derive some business insight from this data. Enterprises could also decide to further perform detailed trend analysis, forecasting using advanced data mining tools.

Fast fwd >> to today:

As EDW’s started getting bigger in size IT soon realized that managing a monolithic data warehouse was cumbersome. Hence the birth of departmental and function specific data marts. But that was not enough since they did not address the core issues of scalability, performance, agility and the ability to handle large volume transactions. Over the years some viable alternates like Database sharding have been used but even that have limited success in terms of scalability. Also it is noteworthy to mention that some of these core issues spawned from the actual limitations of the underlying DBMS’s like MySQL not being able to scale.

Hence investigating alternate DBMS technologies to address these issues has been a focal point of IT managers. So we continue to see the emergence of new DBMS technologies like NoSQL and NewSQL. Similarly we have seen the emergence of MapReduce (Hadoop) in the area of handling unstructured data. The core use case for MapReduce remains in its ability to store massive amounts of data, pre-process it and perform exploratory analytics.

The reality for enterprises is that there are now multiple types of databases in the form of EDW, data marts, columnar MPP stores as well as MapReduce clusters. This ecosystem is being commonly referred to by some industry analysts as Data Lakes.

So if you step back and look at the broader BI space you will notice that there is a lot of effort being spent on getting the plumbing right so that the data (structured as well as unstructured) is massaged and primed. While businesses continue to figure out the optimal data management solution they should not do it without investing in analytics and reporting capabilities needed to extract actionable insights.

I will expand on the reporting and analytics layer in my next post. Specifically trying to address self-service and new technology disruptions like in-memory.

Also the value pyramid assumes an on-premise business intelligence architecture. I will address the Cloud intercept in subsequent posts.