Strata 2012: Data Analytics Transformation Continues

This week, I attended the Strata 2012 conference on Big Data hosted by O’Reilly. The theme was ‘Making Data Work’ and 2000+ industry professionals came to the Santa Clara Convention Center to attend sessions and to network. The day one Jump Start Sessions called ‘The missing MBA for Big Data’ discussed various topics spanning from what a data-driven CEO should be to workforce instrumentation. These discussions were interesting from a more philosophical perspective, but lacked specific examples that would have made the arguments more compelling. One really good example of what I would have liked to see a lot more of was the use case presented in the data-driven CEO session of P&G’s investment and their use of Predictive Analytics. P&G has invested in a big data analysis command center using Business Sphere, a visually immersive data environment that has transformed how they make decisions because it allows them to take advantage of global data in real-time.

The exhibition floor bore some unique surprises in terms of new technologies as well as next generations of traditional DB and BI technologies. I visited the Amazon web services booth and saw a demo of DynamoDB which is a NoSQL database service hosted in the cloud. It is a highly available, distributed database cluster architecture. One of the things that hit me during the demo was the key difference between traditional database architectures and NoSQL – while NoSQL gives you the ability to scale and store massive amounts of data  it does come at the cost of compromising some of the traditional database characteristics, such as database schemas and joins.

As expected, Microsoft had a big presence at the expo. They featured everything from their new SQL 2012 server to their cloud-ready analytics platform. A couple of interesting announcements from MS included its plans to make Hadoop data analyzable via both, the Java Script framework and MS Excel as well as their partnership with Datameer supporting Apache Hadoop service on Windows Azure. The MS Excel connector is particularly relevant because it bridges the gap between the developer-centric Hadoop environment use and end-user centric Excel use – essentially bringing Hadoop data to the end user!

Then there were newer companies showcasing their technologies. One of them was Data Sift. Data Sift is a cloud platform which helps customers to do sentiment analysis using social data (ex. Twitter). It is a very unique way of combining social and business insights. As more and more organizations start embracing social media into their GTM strategies, technologies like these will help them bridge the gap between traditional data management solutions and modern-day social analytics.

VMware announced Hadoop support for their SpringSource framework. The first version will simply allow developers to create MapReduce jobs including Hive and Pig connections as well as scheduling. I am looking forward to seeing their broader vision for the use of data analytics in their mainstream products.

My personal key takeaways from Strata 2012 are:

  1. Discussions on Big Data are much more valuable if founded on customer needs rather than technology capabilities.
  2. Let’s not forget about data federation! As the number of data sources keep multiplying enterprises should start focusing their energy on The Big Picture, i.e. how to use the data to enable them to make business decisions.
  3. This space is in a critical phase of development. There is a huge demand for the right skills evidenced by the flocking of programmers to these conferences.
  4. Many vendors at the conference (MarkLogic, MapR, Hadapt, Revolution Analytics, Datameer, Horton Works, Karmasphere) claim that their technologies complement Hadoop but there is no indication of when Hadoop will become mainstream in enterprise ITs. However, VMware and MS’s announcements at the conference indicate that there is an early effort to begin mainstreaming Hadoop.
  5. NoSQL databases are not getting major traction, yet.


The “Big” in Big Data

Over the last year, many industry analysts have tried to define Big Data. Some of the common dimensions that have been used to define Big Data are the 3 V’s, Volume, Velocity and Variety. (Volume = multiple terabytes or over a petabyte; variety = numbers, audio, video, text, streams, weblogs, social media etc.; velocity = the speed with which it is collected). Although the 3 V’s do a good job as parameters for Big Data there are other things at play that need to be captured to understand the true nature of Big Data. In short, to describe the data landscape more holistically, we need to step beyond the 3 V’s. While the 3V’s are better classified as the salient features of the data, the real drivers of the Big Data are technology, economics and the tangible value that can be extracted from the data, in other words the business insights!

Here I want to take a closer look at some of the drivers of Big Data.


Big Data analysis requires processing huge volumes of data sets that are non-relational with a weak schema, at an extremely fast pace. This need sparked a sudden emergence of technologies like Hadoop that help to pre-process unstructured data on the fly and perform quick exploratory analytics. This model breaks away from the traditional approach of using procedural code and state management to manage transactions.

Along with new preprocessing technologies we have also seen the growth of alternate DBMS technologies like NoSQL and NewSQL that further help to analyze large chunks of data in non-traditional structures (for example using trees, graphs, or key-value pairs instead of tables.)

Other changes are happening on the infrastructure side of things. High performance and highly scalable architectures have been emerging. They include parallel processing, high-speed networking and fast I/O storage, which further help to process large volumes of data at a higher MB/s rate.

In addition to the technological changes we are also witnessing a fundamental paradigm shift in the way DBA’s and data architects are analyzing data. For example, instead of enforcing ACID (atomicity, consistency, isolation, durability) compliance across all database transactions we are seeing a more flexible approach on using ACID in terms of enforcing it whenever necessary and eventually designing a  consistent system in a more iterative fashion.


The emergence of these new technologies is further fueled by the economics associated with providing highly scalable business analytics solutions at a low cost. Hadoop comes to mind as the prime example. I found a valuable white paper that describes how to build a three node Hadoop solution using a Dell OptiPlex desktop PC running Linux as a master machine ( The solution was priced at <$5000.

These kinds of economics are driving a faster adoption of new technologies using off the shelf hardware. Thus enabling even a research scientist or a college student to easily re-purpose his hardware for trying out new software frameworks.

Business Insights:

I cannot stress enough the importance of business insights, also highlighted in my previous blog post (Business Intelligence: The Big Picture). Even as enterprises keep getting smarter at managing their data, they must realize that no matter how small or big their data set is, the true value of the data is realized only when they have produced actionable information (insights)! With this in mind, we must view the implementation of Big Data architectures as incomplete until the data has been analyzed to report out the actual actionable information to its users. Some examples of successful business insights implementations include (but are not limited to):

  • Recommendation engines: increase average order size by recommending complementary products based on predictive analysis for cross-selling (commonly seen on Amazon, Ebay and other online retail websites)
  • Social media intelligence: one of the most powerful use cases I have witnessed recently is the MicroStrategy Gateway application that lets enterprises combine their corporate view with a customer’s Facebook view
  • Customer loyalty programs: many prominent insurance companies have implemented these solutions to gather useful customer trends
  • Large-scale clickstream analytics: many ecommerce websites use clickstream analytics to correlate customer demographic information with their buying behavior.

The takeaway here is that enterprises should remain focused on the value their data can provide in terms of enabling them to make intelligent business decisions. So it important to have this holistic view that does not emphasize certain parameters related to Big Data to the detriment of others. In other words businesses have to keep in the mind the big picture. So how do you measure the impact of  a Big Data implementation for your organization?

Business Intelligence: The Big Picture

There is a lot of buzz about Hadoop, NoSQL, NewSQL and columnar MPP databases. But where is the actual value for businesses? Businesses need to have actionable information derived from their data that they collect on a regular basis. We know how to collect data and store them in databases of various kinds. We have seen the evolutions of SQL databases over the last five decades and databases have gotten sophisticated in terms of processing structured data. With the recent explosion of social media and with it the proliferation of unstructured data, new technologies have emerged, such as MapReduce. So now we have the data but the real question is where does the business value actually get delivered? The answer is simple. The value does not lie in the way data gets pulled into the database or how the database is optimized to handle new varieties of data. While these steps are important the value continues to be delivered at the Analytics and the end user Reporting layer as illustrated in the Business Intelligence value pyramid.


© image shree dandekar 12/6/2011

Rewind << for a second:

The 1990 era was all about capturing business relevant data, storing it using business constructs into a database. Typical use cases involved performing OLTP (Online Transaction processing) workloads on that data. We saw the evolution of Data Warehouses as enterprises started to seek out more analytical insights from the data stored in the database which gave rise to OLAP (Online analytics processing) workloads. Once in the data warehouse the data was cleansed, filtered and augmented with Business rules using some traditional ETL (Extract, Transform, Load) or Data Integration tools, thus removing any redundancies from the data as well as normalizing it. You would still have to run a Business Intelligence capability against this data to develop dashboards or reports to actually be able to derive some business insight from this data. Enterprises could also decide to further perform detailed trend analysis, forecasting using advanced data mining tools.

Fast fwd >> to today:

As EDW’s started getting bigger in size IT soon realized that managing a monolithic data warehouse was cumbersome. Hence the birth of departmental and function specific data marts. But that was not enough since they did not address the core issues of scalability, performance, agility and the ability to handle large volume transactions. Over the years some viable alternates like Database sharding have been used but even that have limited success in terms of scalability. Also it is noteworthy to mention that some of these core issues spawned from the actual limitations of the underlying DBMS’s like MySQL not being able to scale.

Hence investigating alternate DBMS technologies to address these issues has been a focal point of IT managers. So we continue to see the emergence of new DBMS technologies like NoSQL and NewSQL. Similarly we have seen the emergence of MapReduce (Hadoop) in the area of handling unstructured data. The core use case for MapReduce remains in its ability to store massive amounts of data, pre-process it and perform exploratory analytics.

The reality for enterprises is that there are now multiple types of databases in the form of EDW, data marts, columnar MPP stores as well as MapReduce clusters. This ecosystem is being commonly referred to by some industry analysts as Data Lakes.

So if you step back and look at the broader BI space you will notice that there is a lot of effort being spent on getting the plumbing right so that the data (structured as well as unstructured) is massaged and primed. While businesses continue to figure out the optimal data management solution they should not do it without investing in analytics and reporting capabilities needed to extract actionable insights.

I will expand on the reporting and analytics layer in my next post. Specifically trying to address self-service and new technology disruptions like in-memory.

Also the value pyramid assumes an on-premise business intelligence architecture. I will address the Cloud intercept in subsequent posts.