Saturday, July 2, 2011

Big Data – Same Problems?

A recent (June 2011) IDC Digital Universe study found that the world's data is doubling every two years—this is growing faster than Moore's Law.  It reckoned that 1.8 zettabytes (1.8 trillion gigabytes) will be created and replicated in 2011 and that Enterprises will manage 50X more Data and Files will Grow 75X in the Next Decade.
The “big data” phenomenon is driving transformational, technological, scientific, and economic changes and "Information taming" technologies are driving down the cost of creating, capturing, managing and storing information

We’ve all seen how organisations have an insatiable desire for more data as they believe that this information will radically change their businesses.

They are right – but it’s only the effective exploitation of that data, turning it into really useful information and then into knowledge & applied decision making that will realise the true potential of this vast mountain of data.

Incidentally, do you have any idea how much data 1.8 zettabytes really is?  It’s about the same amount of data if every person in the world sent twenty tweets an hour for the next 1200 years!

Data by itself is useless, it has to be turned into useful information & then have effective business intelligence applied to realise its true potential.

The problem is that big data analytics push the limit of traditional data management.  Allied to this the most complex big data problems start with huge volumes of data in disparate stores with high volatility of data.  Big data problems aren’t just about volume though; there’s also the volatility of the data sources & rate of change, the variety of the data formats and the complexity of the individual data types themselves.  So is it always the most appropriate route to pull all this data into yet another location for its analysis? 

Unfortunately though many organisations are constrained by traditional data integration approaches that can slow adoption of big data analytics. 

Approaches which can provide high performance data integration to overcome data complexity & data silos will be those which win through.  These need to integrate the major types of “big data” into the enterprise.  The typical “big data” sources include:
  • Key/value Data Stores such as Cassandra,
  • Columnar/tabular NoSQL Data Stores such as Hadoop & Hypertable,
  • Massively Parallel Processing Appliances such as Greenplum & Netezza,  and
  • XML Data Stores such as CouchDB & MarkLogic.
Fortunately approaches such as Data Federation / Data Virtualisation are stepping up to meet this challenge.

Finally & of utmost importance is managing the quality of the data.  What’s the use of this vast resource if its quality and trustworthiness is questionable?  Thus, driving your data quality capability up the maturity levels is key.

Data Quality Maturity – 5 levels of maturity
Level 1 - Initial
Level 2 - Repeatable
Level 3 - Defined
Level 4 - Managed
Level 5 - Optimised
Limited awareness within the enterprise of the importance of information quality.  Very few, if any, processes in place to measure quality of information. Data is often not trusted by business users.
The quality of few data sources is measured in an ad hoc manner. A number of different tools used to measure quality. The activity is driven by a projects or departments.   Limited understanding of good versus bad quality.  Identified issues are not consistently managed.
Quality measures have been defined for some key data sources.  Specific tools adopted to measure quality with some standards in place. The processes for measuring quality are applied at consistent intervals.  Data issues are addressed where critical.
Data quality is measured for all key data sources on a regular basis. Quality metrics information is published via dashboards etc.  Active management of data issues through the data ownership model ensures issues are often resolved. Quality considerations baked into the SDLC.
The measurement of data quality is embedded in many business processes across the enterprise. Data quality issues addressed through the data ownership model. Data quality issues fed back to be fixed at source.