Wednesday, November 23, 2011

Data Virtualisation As An Approach To Data Integration

Many different approaches are now available for Data Integration, yet far and away the most popular approach currently still remains as Extract Transform and Load (ETL).
However the pace of Business change and the requirement for agility demands that organizations support multiple styles of data integration.

Three leading options present themselves; let’s now describe the differences among the three major styles of integration.

1.        Physical Movement and Consolidation

Probably the most commonly used approach is physical data movement.  This is used when you need to replicate data from one database to another.  There are two major genres of physical data movement, Extract Transform & Load (ETL) and Change Data Capture (CDC). 
ETL is typically run according to a schedule and is used for bulk data movement, usually in in batch.  CDC is event driven and delivers real-time incremental replication.  Example products in these areas are Informatica (ETL) and GoldenGate (CDC).

 2.        Message based synchronization & propagation

Whilst ETL and CDC are Database to Database integration approaches, the next approach, message based syncronisation and data propogation is used for application to application integration.  Once again there are two main genres, Enterprise Application Integration (EAI) and Enterprise Service Bus (ESB) approaches, but both of these are used primarily for the purpose of event driven business process automation.  A leading product example in this area is the ESB from Tibco.

 3.        Abstraction / Virtual Consolidation (aka Federation)

Thirdly you have Data Virtualization (DV).  The key here is that the data source (usually a database), and the target or consuming application (usually a business application) are isolated from each other.  The information is delivered on-demand, to the Business Application when the user needs it.  The consuming business application can consume the data as though it were a database table, a star schema, an XML message or in many other forms.  The key point with a DV approach is that the form of the underlying source data is isolated from the consuming application.  The key rationale for Data Virtualization within an overall Data Integration strategy is to overcome complexity, increase agility and reduce cost.  A leading product example in this area is Composite Software.

ETL or DV?
The suitability of Data Integration approaches needs to be considered for each case.  Here are 6 key considerations to ponder:

1. Will the data be replicated in both the DW and the Operational System?

      Will data need to be updated in one or both locations?
      If data is physically in two locations beware of regulatory & compliance issues associated with having additional copies of the data (e.g. SoX, HIPPA, BASEL2, FDA etc)

2. Data Governance

      Is the data only to be managed in the originating Operational System?

      What is the certainty that a DW will be a reporting DW only
(vs Operational DW)?

3. Currency of the data, i.e. Does it need to be up to the minute?

      How up to date are the data requirements of the DW?
      Is there a need to see the operational data?

4. Time to solution i.e. how quickly is the solution required?

      Immediate requirement?
      Confirmed users & usage?

5. What is the life expectancy of source system(s)?
      Are any of the source systems likely to be retired?
      Will new systems be commissioned?
      Are new sources of data likely to be required?

6. Need for historical / summary / aggregate data
      How much historical data is required in the DW solution?
      How much aggregated / summary data is required in the DW solution?

 Leading analyst firms like Gartner are recommending that data virtualization be added to your integration tool kit, and that you should use the right style of data integration for the job for optimal results. 
 Just like so many things in Infromation MAnagement - there's more than way way to accomplish Data Integration; ETL is not the only way.  Data Virtualisation is well worth considering a a part of your overall strategy. 

Saturday, July 2, 2011

Big Data – Same Problems?

A recent (June 2011) IDC Digital Universe study found that the world's data is doubling every two years—this is growing faster than Moore's Law.  It reckoned that 1.8 zettabytes (1.8 trillion gigabytes) will be created and replicated in 2011 and that Enterprises will manage 50X more Data and Files will Grow 75X in the Next Decade.
The “big data” phenomenon is driving transformational, technological, scientific, and economic changes and "Information taming" technologies are driving down the cost of creating, capturing, managing and storing information

We’ve all seen how organisations have an insatiable desire for more data as they believe that this information will radically change their businesses.

They are right – but it’s only the effective exploitation of that data, turning it into really useful information and then into knowledge & applied decision making that will realise the true potential of this vast mountain of data.

Incidentally, do you have any idea how much data 1.8 zettabytes really is?  It’s about the same amount of data if every person in the world sent twenty tweets an hour for the next 1200 years!

Data by itself is useless, it has to be turned into useful information & then have effective business intelligence applied to realise its true potential.

The problem is that big data analytics push the limit of traditional data management.  Allied to this the most complex big data problems start with huge volumes of data in disparate stores with high volatility of data.  Big data problems aren’t just about volume though; there’s also the volatility of the data sources & rate of change, the variety of the data formats and the complexity of the individual data types themselves.  So is it always the most appropriate route to pull all this data into yet another location for its analysis? 

Unfortunately though many organisations are constrained by traditional data integration approaches that can slow adoption of big data analytics. 

Approaches which can provide high performance data integration to overcome data complexity & data silos will be those which win through.  These need to integrate the major types of “big data” into the enterprise.  The typical “big data” sources include:
  • Key/value Data Stores such as Cassandra,
  • Columnar/tabular NoSQL Data Stores such as Hadoop & Hypertable,
  • Massively Parallel Processing Appliances such as Greenplum & Netezza,  and
  • XML Data Stores such as CouchDB & MarkLogic.
Fortunately approaches such as Data Federation / Data Virtualisation are stepping up to meet this challenge.

Finally & of utmost importance is managing the quality of the data.  What’s the use of this vast resource if its quality and trustworthiness is questionable?  Thus, driving your data quality capability up the maturity levels is key.

Data Quality Maturity – 5 levels of maturity
Level 1 - Initial
Level 2 - Repeatable
Level 3 - Defined
Level 4 - Managed
Level 5 - Optimised
Limited awareness within the enterprise of the importance of information quality.  Very few, if any, processes in place to measure quality of information. Data is often not trusted by business users.
The quality of few data sources is measured in an ad hoc manner. A number of different tools used to measure quality. The activity is driven by a projects or departments.   Limited understanding of good versus bad quality.  Identified issues are not consistently managed.
Quality measures have been defined for some key data sources.  Specific tools adopted to measure quality with some standards in place. The processes for measuring quality are applied at consistent intervals.  Data issues are addressed where critical.
Data quality is measured for all key data sources on a regular basis. Quality metrics information is published via dashboards etc.  Active management of data issues through the data ownership model ensures issues are often resolved. Quality considerations baked into the SDLC.
The measurement of data quality is embedded in many business processes across the enterprise. Data quality issues addressed through the data ownership model. Data quality issues fed back to be fixed at source.

Tuesday, April 5, 2011

Data Virtualisation at EDW 2011

Just completed my presentation on The role of Data Virtualisation at Enterprise Data World 2011. Really lively discussion on the big drivers for DV.

Fundamentally it's to:
  • Mask Complexity;
  • Improve Agility
  • Reduce Cost.
Check out my earlier post on DV here.

Friday, April 1, 2011

Now what kind of error caused this?

Early this morning, sighted in the river Avon, close to my office in my home town of Bath was what appeared to be a stranded naval vessel.

Bath is home to several Royal Navy departments
including Submarine design, however none of their designs have previously been seen outside the confines of the secure MoD establishments.

Curious onlookers gazed from Pulteney bridge and wondered what kind of navigation data error caused this. 

Or maybe it was a date er error ;)


Wednesday, March 30, 2011

I wish I'd said that

Have you ever had one of those moments when you thought "I wish I'd said that"?
Well, I was getting my head together in preparation for my conference presentation next week at Enterprise Data World in Chicago.  Well, although I've already submitted my slides for a talk on Data Virtualisation as a viable Data Integration approach, I thought I'd do some last minute research. From companies I've worked with I'm very aware of the benefits Data Virtualisation can bring particualrly for flexibility & rapid time to solution.  But I wanted to get some more quotes - so off to friendly Google I went.  Pretty quickly I came across a variety of finds including  
“The difficulties in dealing with the ongoing data explosion and the proliferation of ever-more diverse data sources has resulted in companies being open to reevaluating their data integration strategies,”
Wow - just what I'm after .  A little bit more digging & I also found 
“The availability of a new generation of data virtualization tools and business intelligence (BI) solutions which easily integrate with ERP systems has undoubtedly provided real benefit in reducing overall time to solution and a business opportunity for those organizations who best leverage those data assets,”  
Excellent - I'll use that last one in my presentation.  Now, who said it?  Well apparantly I did :)

Tuesday, March 22, 2011

National Australia MDM, Governance and Regulations

National Australia Group Europe, MDM, DG & Financial Regulations:  Tuesday 22nd March 2011, 10:05am  IRM MDM/DG Europe Conference.

Martin Campbell & Tim Franklin described Clydesdale Bank's (part of NAGE) approach to Customer MDM and Data. governance. Campbell described the Bank's challenges & importance of executive buy in. The FSCS regulatory issues were of utmost importance & fixed time constraints for these had to be achieved.
Franklin outlined the IPL Information Architecture Framework (IAF) and how the governance component of the IAF was expanded to initially benchmark & then form the basis of the Bank's Data Governance approach.
Of particular interest was the importance of establishing principles & getting early buy in for these; the IPL IAF proved to be a useful jump start here.
Overall very interesting & practical.

MDM the next decade; Go early go governance

MDM the next decade; Go early go governance: Tuesday 22nd March 9AM at the MDM/DG 2011 Europe conference.

Aaron Zornes presented some interesting statistics and speculation regarding the future of MDM. I agree with the thought that the trend is towards pro-active DG for MDM
Despite the European tag on the talk both the spelling and the content was still very US centric. Most surprising was continual mention of ETL and SOA technologies to support DG - fine in themselves, but very surprising that nothing was mentioned on data federation/virtualisation.  This made me question just how up to date the thinking really is.
Overall I came away rather disappointed.

Monday, March 21, 2011

I'm presenting at Data Governance 2011 - London

Monday 21st - Wednesday 23rd March sees the 2011 IRM MDM/ DG Europe Conference in London. On Wednesday afternoon I'm presenting a case study with Colin Wood on Clinical Data Governance.  If you're in town at the conference, be sure to stop by and say hello.

Saturday, March 19, 2011

BA Air Miles - What's the point?

Massively frustrated at the unavailability of seats that you can use BA miles on.
As a loyal BA customer over the years I'm now seriously wondering just what is the point.

I've got lots of miles and Amex companion vouchers. Back around Christmas my family & I thought we'd like to do a mega holiday in July, August or September this year, particularly having had two family bereavements in 2010. We fancied San Francisco, Vancouver, Australia or New Zealand and have enough miles for all 4 of us to go First or Business Class. After several days of searching availability & then phoning BA we were told there are no available miles redemption seats - to any of those destinations.  This despite availability showing if you buy with cash. "What about buy with cash & upgrade with miles" I inquired. Can't do that either :(    What about nearby cities LA, Seattle?  No BA air miles seats available to those either!

Just this week I again tried to use some BA miles, this time for a run of the mill business trip to Chicago in mid April. I received the same story again. No availability of any miles redemption seats. Once again I tried to buy with cash & upgrade with miles and once again was told no go despite lots of availability showing if you buy with cash.

So I'm wondering, unless you book miles redemption seats a full year in advance (apparently that's when the paltry few actually get released) then just what is the point in being a loyal customer & collecting BA miles?

Friday, March 18, 2011

Data modelling as art

Have you come across Data Modellers who exhibit OCD type behaviour when it comes to laying out models?
This often manifests itself as obsessive behaviour to eliminate crossing lines (BTW I think you should strive to minimise crossing lines), or the addition of not very subtle layout and annotation.  Frequently this steers me to think they believe their Data Models are works of art.
But is it art……..
Well funnily enough when I was recently in Philadelphia I went to the Art Museum and in the modern & contemporary gallery I saw this picture.
Standing in front of it I was approached by the gallery curator who said “Interesting isn’t it?  What does it say to you?”
“it’s an unnamed entity” I said
“Wow – that’s deep, I’ve not heard that before” she replied
“yes, and not only that it’s in a one to one relationship with another entity” I said.
By now, she seemed to think I was some art connoisseur and enquired “does it say anything else to you?”
I replied “well, it looks to me like it might be a subtype of some super entity”
By now, my colleagues (Nic & Inna) who also are fellow Information Management folks overheard what was going on and told me to stop winding up the curator.  Throughout the whole discussion she’d been taking notes in a little book on what I’d been saying to her on "my interpretation" of this masterpiece.
So you never know, maybe future visitors to the Modern & Contemporary Art Gallery will be told of an interesting interpretation by some crazy English guy of this picture.
Personally I don’t get art at all!

Confused by the name - surely not!

A Red Race Car

A Red Pick Up
I was pleased to read that common sense has prevailed and the Ford / Ferrari lawsuit has finally been amicably resolved.  However when it first came up I wondered just who on earth could possibly be confused about an F150.  I had visions of dissapointed customers lining up in Ford showrooms wanting to know why the engine didn't rev to 19,000 RPM.  I also had mental images of baseball hatted checkshirt buyers quizzing the Ferrari sales folks why there was only one seat and where you fit the gun rack.  Who could possibly be confused?  A pickup buyer maybe?

Thursday, March 17, 2011

Now I've gone & done it!

Back around Christmas in an off guarded moment I was asked if I'd be willing to give someone a passenger ride around Castle Combe Circuit in my race prepped car.  Without really checking I agreed.  Well now, it turns out it was an auction of promises for a very worthy cause, to support the Peggy Dodd Centre which cares for people with Alzheimer's and other dementia illnesses.
The auction turned out to be be a very high profile formal affair & to my horror my promise turned out to be one of the star lots auctioned on the night.
So now I really do have to make sure the car is fully prepared after its winter layover - last time it had an outing was for a race in October! 
Still, looking at the available dates that I'm actually allowed to take passengers on track, one of the nearby ones is April 29th - so maybe I'll get to avoid the Royal Wedding after all :)

Virtually Yours?

Most of us will be familiar with the challenge of providing a common view of a type of data from multiple heterogeneous systems. This could be for providing consolidated data for management reporting, or a 360 degree view of say customer data from several “MDM” sources, or even just getting data damn quick for that BI or legislative reporting requirement.

The traditional approach is Extract, Transform and Load (ETL) to another store (eg a Data Warehouse) and then report from there.

However, that’s not the only way. Enterprise class Data Virtualisation products such as
Composite Software have now made the promise of Data Federation a realistic alternative for some use cases – let’s have a look at a few.

Data migration and take on ETL vs EII (or both?)
By now most of us will be familiar with the purpose of Extract, Transform and Load tools.  Less well known however are the capabilities of the Data Virtualisation or Enterprise Information Integration  tools such as Composite or MetaMatrix.
Broadly speaking these provide the capability to access data from a massively wide variety of sources without having to move it from the source system.  They have extremely rich caching and aggregation capabilities and in my experience have dramatically reduced the time to provide rich access to data.  I once heard them described as “views on steroids”.
Can EII / Data Virtualisation add value to Data Warehousing?
The use of EII technology in Enterprise Data Warehousing and for data take-on is something that demands serious consideration.  There are several ways in which EII can add value to DW solutions; here are just 3 to consider:
a)        Prototyping Data Warehouse Development
During DW development, the time taken for schema changes, adding new data sources and providing data federation are often considerable.  Using Data Virtualisation to prototype a development environment means you can rapidly build a virtual DW rather than a physical one.  Reports, dashboards and so on can be built on the virtual DW.  After prototyping the physical DW can be introduced.
b)        Enriching the ETL process
Frequently new data sources particularly from ERPs are required in the DW.  All too often the ETL lacks data access capabilities to complex sources.  Tight processing windows may require access, aggregation & federation activities to be performed prior to the ETL process.  The powerful data access capabilities of EII provide rich access and federation capabilities which can present virtual views to the ETL process which continues as though using a simpler data source.
c)         Federating Data Warehouses
How many organisations have more than one DW?  Is the Information in each completely discrete?  I don’t think so.  Data Virtualisation provides powerful options to federate multiple DW’s by creating an integrated view across them.  This has particular relevance in providing rapid cross warehouse views following a merger or acquisition.

Data take on considerations ETL or EII?
When providing data into a DW, the use of ETL or EII (or both) needs care.  Some of the key considerations include:

Data replicated in DW and Operational System
Update in one or both locations?
If data is physically in two locations are there compliance issues (e.g. SoX, HIPPA etc)
Data Governance
Is the data only managed in the originating Operational System?
Currency of the data
How up to date are the data requirements of the DW?
Is there a need to see the operational data?
Time to solution
How rapidly is a solution required?
Life expectancy of source system(s)
Are the source systems likely to be retired?
Need for historical / summary / aggregate data
How much historical, aggregated data is required in the DW solution?

So whilst not applicable for every use case, the reality of having your data virtually served is well and truely there.

Fixing the flaws in Government IT

I recently had a look at the report here and got to thinking - what about Information & who guards the guards?
It's interesting when looking at what’s wrong with Government IT, the 6 authors are:
…. a Research Analyst at the Institute for Government.
….. a Senior Researcher at the Institute for Government.
…..a Senior Researcher at the Institute for Government …. previously worked in the Canadian civil service.
….. an Intern at the Institute for Government up until February 2011,
……..a Fellow at the Institute for Government;
…….a Senior Fellow at the Institute for Government

So naturally if we think there’s something wrong with Government IT (surely the whole premise behind commissioning the report) then a good place to start would be with exemplar organisations & practices that are “not wrong”.
So having got that rant over & actually believing that the authors are not best placed to provide objective criticism, here’s my 2p worth.
The focus is predominantly on technology.
The CIO in the vast majority of organisations is actually not an “Information” officer, but a “Technology” officer.  The few corporates that have successfully got to grips with how “ICT” can effectively serve the business are those who understand that whilst organisation / functional units change, personnel change, and data volumes increase, that the fundamental definitions / concepts of business data (ie the conceptual / logical models) are relatively stable.
I say relatively because of course with wholesale mergers / acquisitions / divestments etc there can be larger change. 
Fundamentally, the information (and business process) models provide a good foundation upon which detailed technical processes (ie programs, packages, XML messages or whatever) can be built / implemented.  The unholy focus upon the “T” of IT witnessed especially in Government is analogous to spending lots of time & energy picking out the carpets, curtains & wallpaper because all that foundations & plumbing stuff is boring.
It’s about time government sat up & realised that Information across Government business areas / departments needs to be managed:  I was going to say … managed as well as within Government departments, but evidence shows that the discipline of true “Information Management” in most departments is woefully misunderstood, and the special competencies required are not present.  Not only that, the critical importance of information management as a professional discipline is not well understood - just how many “information management” professionals in Government IT have the Industry Data Management Qualifications?  Now compare that with say HR or Accounting professionals!
So why do we need a cross Government Information view?
Anti Money laundering
Illegal immigration
Homeland security
Counter terrorism
Organised crime
Benefit fraud
…… I could go on
So what’s’ to be done:?
Create a Government “Information Management” officer & executive.
Establish cross government Information Management, Governance, Quality and Ownership responsibilities.
Think global – act local; ie establish the need / types / quality etc for shared information but devolve the responsibility to a “lead” department.  After all in the real world. Corporate data governance programs establish data owners in the business to be responsible for the cross organisation stewardship of that type of data for the good of the whole company.
Key game changer is that Information must be thought of as a corporate (vs departmental) asset and its management must be for the good of the entire organisation – not just the silo I live in.
Until that happens, we’ll continue to have CIO’s focusing on T who don’t give an D about I