On-Demand: Full Featured Flash Relay
On-Demand: MP4
Claudia Imhoff, PhD, President and Founder, Intelligent Solutions, Inc.
Colin White, Founder, BI Research
Robert Eve, VP Marketing, Composite Software
1
Gary Damiano
Hello ladies and gentlemen. I’d like to welcome you today to our webcast, “Caution, Bumps Ahead – Common Data Virtualization Mistakes to Bypass.” Brought to you by Composite Software as part of our data virtualization leadership series of webcasts. Thank you for joining us today. We have an accomplished group of experts joining us today. We’d like to welcome distinguished analyst from the Data Management and BI entities, Dr. Claudia Imhoff, President and Founder Intelligent Solutions. Colin White, founder of BI Research, and Robert Eve, Executive Vice President, Marketing, Composite Software.
2
Today’s topic is common data virtualization mistakes to bypass. Our speakers will be sharing the findings of newly published research on data virtualization usage and practices. You’ll be learning about where you can gain value from data virtualization by avoiding common mistakes. Our panel of experts will provide a perspective on ensuring successful data virtualization from projects and will discuss real world case studies involving best practices that evolved from common mistakes and they will help you develop a plan to avoid those mistakes.
3
First, let me start by introducing our panel of speakers. Let’s start with Dr. Claudia Imhoff. A leader, visionary and practitioner in the rapidly growing field of Business Intelligence, Dr. Imhoff is a popular and dynamic speaker of business intelligence and the infrastructure that supports these initiatives. She’s the co-creator of the Corporate Information Factory architecture, also known as CIF. Dr. Imhoff has authored five highly regarded books on these subjects and writes monthly columns for technical and business magazines. Hello Claudia, it’s great to have you here.
Claudia Imhoff
Thank you, it’s wonderful to be here!
Gary Damiano
Next I’d like to introduce Colin White. Colin White is the founder of BI Research. As an analyst, educator and writer he is well known for his in-depth knowledge of business intelligence, data management, information integration technologies, and how they can be used for supporting smart and agile decision making. With 40 years of IT experience, he’s consulted for dozens of companies throughout the world and is a frequent speaker at IT events. Thanks for joining us today Colin.
Colin White
It’s my pleasure, thank you for inviting me.
Gary Damiano
And a few words about our third speaker, Robert Eve, Executive Vice President of Marketing, Composite Software. Bob Eve’s experience includes executive level marketing and business level roles at leading enterprise software companies such as Informatica, Mercury Interactive, PeopleSoft and Oracle. Bob is a prolific writer authoring dozens of white papers and blogs on BI, data integration and virtualization. Welcome Bob.
Robert Eve
Thanks Gary, glad to be here.
Gary Damiano
I’m Gary Damiano, Vice President Field Marketing, Composite Software and I will be the moderator for today’s webcast. Our agenda and discussions today are based on a new white paper written by Claudia and Colin and published last week, looking at the common mistakes that businesses have made implementing data virtualization. We provide you with information at the end of this webcast to help you find this informative and interesting research as well as resources that you’ll find very useful. Let’s start our discussion with Claudia.
4
Claudia Imhoff
Alright, well thanks very much Gary. Thanks very much Bob Eve, and of course many thanks to composite end Colin White. It’s been a pleasure working with all of you on this. One of the things that I wanted to start off with of course is a little bit of a definitional type of thing, what is data virtualization? It’s also called Data Federation and many of you may know it by that term. It’s relatively new, of course. But then again it’s something that has made a tremendous impact in a lot of companies’ data management strategies and that’s a good thing.
What it’s all about is allowing you to quickly create a virtual, not real but virtually integrated sets of data for anything you can think of. For business intelligence, for customer relationship management, even master data management and so forth. The good news is that it combines data together without actually having to move the data from the original sources. And what that does is it greatly speeds up the process of data integration.
5
What we also know about data virtualization is that it has, as I mentioned, a significant role in data management strategies. Certainly ETL, extraction transfer and load is still a very popular data integration tool, and that’s a physical tool. But the ability to integrate data virtually really has had a great impact on our integration developers. However, and the whole point of this presentation today, it does have its limits. And that’s something that we need to understand. When to use data virtualization versus when to use that old data integration work horse, data consolidation or ETL. Now with that, I’m going to turn it back over to you Gary, because I understand we have a pole coming up next.
Gary Damiano
Thanks Claudia. This is a great place to pole our audience on their perspectives towards data virtualization. So why don’t we bring up our first pole. So where are you in your adoption of data virtualization? I’ll give you a minute to respond to that.
Claudia Imhoff
I’ll be most interested to see how they do respond.
Gary Damiano
Well it’s good to see a few people currently using it because I think they’re going to be very interested in kind of the best practices aspect and learning from other’s successes and mistakes. I think the folks that are considering it this is a great opportunity to bypass a few of those speed bumps and make some real success quickly. Ok, let’s publish these results.
6
Claudia Imhoff
Wow, even Steven there for using it and considering using it. Over 62%, that’s wonderful. Yeah those that are considering it, I hope that this presentation will save you some pain. It is difficult sometimes to determine when to use it and when not to use it. Colin, comment from you?
Colin White
No, the results are interesting. I like the three way split, 66% of people are using it or are about to use it. Interesting.
7
Claudia Imhoff
Alright, well let me continue on then. Let’s talk about when to use it and when not to use it. The first one is the enterprise data warehouse. Everybody always asks these questions. Should I use data virtualization to create an enterprise data warehouse? And the answer categorically, at least as of today, is no, it should not be used to create a virtual EDW. What that would require is that our operational systems would have to undergo a pretty massive overhaul. They’d have to maintain the historical data for all of the queries. For example if you have queries that go back two years or three years or you want to look at trends in purchases and customer behaviors and buying behaviors over the last four or five years, well guess what? Your operational systems have to maintain that four or five years of history and that would really clog them up.
The second thing is the ability to support the mixed workload. These operational systems are set up obviously to do transaction processing very quick, fast transaction, fast action and very fast response times. Again, if we start throwing at it these queries for business intelligence or analytics, mixing the longer running queries with these short, fast transactional queries, you can imagine that the operational systems would kind of fall on their sword. So I would have to say that the current state of operational systems and the current state of data virtualization technologies are not suited for virtual enterprise data warehouse in that you still have to create your physical enterprise data warehouse using the traditional ETL or data consolidation type processes.
8
However, one of the very nice things that we can see with data virtualization is to be able to extend the ETW. In other words bring in information outside of our data warehouse such as some current operational information or perhaps if you’re in the situation where you have multiple data warehouses, you can bring data together from the multiple data warehouses without having to physically move the data or combine the data into a single data warehouse. So you can see from the picture there that we can actually bring in operational data and have it reside right next to our analytic results. For example it might be maybe the current view of a customer’s order. Order versus some of their history or some of their information about their segmentation or lifetime value being combined together or being shown together through our data virtualization capabilities. Or like I said combining data warehouses or perhaps data from data marts together into a single, unified view, a virtualized view of both of those things. So that’s one that we certainly can use data virtualization for.
The second one, after I said you can’t use it to build an EDW, the second thing it can be used for is prototyping. The prototyping in a BI project is certainly the cornerstone. We have to prototype because we’re not really sure of the requirements that our business users will like. So creating a physical prototype, gathering the user input, paring down that prototype as we get that input and it changes, and then rebuilding it based on a second round of user feedback, and start the examination process all over again. Obviously that can be extraordinarily time consuming, just building the prototype alone can be timer consuming. So instead of going through this arduous process of creating a physical prototype, this is where data virtualization comes in. It makes a lot of sense that we should be able to rapidly bring together the data from the operational systems for a quick peak, not actually analyzing it, but perhaps looking at the data, making sure that the data is the appropriate data for our business intelligence project. And that way you also create much better documentation for doing the consolidation. If you see data quality problems in this virtual prototype, you know you’re going to have data quality problems when you try to do your ETL processing and it can really help understand the state of the data, understand the needs of the business users themselves, and really get a good handle on what you’re going to be facing as you extract that data, transform it and then load it up. So again, a very interesting environment, one that I do recommend is this virtual prototype and the physical reality. Now with that, I’m going to turn it over to Bob. I think you have a case study for us?
9
Robert Eve
Yeah, thank you Claudia. I think one way to really think about data virtualization is in the mode of a compliment to your enterprise data warehouse. And to show that point, and to bring a couple of those cases you just described to life, I’ll talk about one company, Pfizer, who has actually used us in a couple of ways of complimenting their data warehouse. The first example is a case where they really ended up federating two data warehouses due to a merger, when Pfizer bought Wyeth you now had duplicates of a lot of data warehouses. One of which was the research data warehouse. So they were able to very quickly, through a data virtualization layer, see the data across both systems and maintain the existing reporting as well as add new, combined reporting. So that was a very effective case and they were able to combine that data in just a few weeks. Whereas in earlier acquisitions, that project would have been taken months.
10
Second case, also at Pfizer, is in the data warehouse post-typing areas that you described Claudia, as an important area. Turns out in the research part of the business, there’s a lot of change: clinical trial results, new discoveries, competitive activities, so a lot of analysis going on, quickly understand the implications of those market and internal events. Now some of those analyses are very quick and it can be done in just a few days in a virtual approach it should just kind of as one-offs. Others of those analyses come along and they start to realize that this is something that we probably should be tracking for a longer time frame. So they kind of used the virtualized approach initially, start to build towards a warehouse once they kind of get it locked in over several months, then they’ll move from the virtual to the physical. What’s really been quite effective, they can answer the questions quickly, say yes when the business asks “Hey can you help me with this?” Because it’s bad to say if I have to build an entire warehouse, the answer is probably no, but if the answer is I can build something virtually very quickly, the answer is yes. And then they can fine tune it, fine tune it, fine tune it until they get to the exact need and they really deliver to the business. So I think those are a couple of good examples of how data virtualization can compliment a data warehouse.
11
With that, why don’t we talk about the situation in operational reporting and I’ll swing that over to Colin.
Colin White
Ok, thanks Bob. It’s interesting that when data virtualization or data federation, as Claudia mentioned is another name for it started, it was very much used the data warehousing space for accessing historical data, building analytics on it and other case studies that Claudia and Bob have been talking about. I think the other thing is that sort of as data virtualization has evolved and matured, now you serve many other different purposes. One is accessing live, operational data, which means of course we can use it for operational reporting against that live operational data. But one thing we have to consider is why we built data warehouses to create historical data in the first place with performance. And Claudia mentioned this earlier, you start querying and analyzing data in parallel with dating it as an operational system and that has a performance impact. As we’ll see later on in some of the discussion, we need to consider the performance impact of data virtualization on operational systems. But there’s no question that we can’t use data virtualization for operational reporting unless we keep that performance in mind.
I think the other thing is, and this is a fairly common use case, and I think Bob will talk about this in a second, but I do see a lot of cases where people use virtualization for accessing both historical data in warehouses and also current data in operational systems. A good example there would be a call center that may actually use information about the customer like lifetime value score, or any products that they’ve ordered recently, and if they want to get an upstate view of the customer of course they would then need to go into the live operational data to take a look at that. So that kind of combination is quite important.
12
As we see on the next slide that there are other considerations here that if you think about operational reporting analysis, it is a type of operational BI, which is one of the fastest growing areas of business intelligence. And I often get asked, “Can we use data virtualization for real time data analysis?” And the answer is no. Of course, I have to define the word real time here. By real time we’re talking about split second or a few seconds response time. And the reason is we have to recognize that virtualization in general is a pull type of technology. In other words we issue a query against data targets or data sources and we pull that data out. So if we wanted really split second analysis or data that’s almost real time, or within a few second, we would have to constantly issue queries to get that up to date data. Usually for that kind of processing, we’d use more of an event-driven architecture where the operational system is pushing data to the analytical systems for analysis. That’s much more efficient. I have actually seen people use virtualization to emulate event processing by polling an event source like a message cue. But in general, an event-driven architecture is much more efficient.
The bottom line here is we can use virtualization for operational BI, particular reporting, some analysis if we simply want to do that for every hour, two hours, or every thirty minutes, but we can’t really use it for real time. I’d like to go back to Bob for a second. I think he’s got some case studies here that talks about operational reporting, one of Composite’s customers.
13
Robert Eve
Thanks Colin. Here’s a case with an energy company, Aera Energy. They are the largest producer of oil in California. They’re the California operations of Shell and Exxon. I think it’s important to distinguish, as you say, how real time is real time. Really I think another way to think about it is up to the minute data, or a snapshot of data, a current snapshot. So for example, in Aera’s case, they have a nightly data warehouse build of information from all the different 10,000 wells. That’s 1800 ETLs I think that they run every night. You know, surface data, subsurface data, a lot of financial and plant maintenance information. But during the middle of the day, if a well goes down, then you have a maintenance management question. Where’s the current equipment? Where are the current crews? How should we dispatch the current crews and equipment to the walls that need repair? And at that point you really want to take a snapshot and get an up to the minute view of where all the equipment is and where all the crews are and what’s the flow rates on the different wells and kind of make a decision as to how to dispatch those resources.
So that’s what we’ve been able to do for Aera, is provide them that immediate snapshot. So instead of ten guys getting on the phone and talking about what they’re seeing and what they could work out, we have now one dispatcher able to make a real quick decision and that gets the resources over to the well in need of repair quickly. They get the revenue up, you can avoid big problems, and that’s really proved to be a nice operational BI use case sort of based on an event that is not a click stream type of event but a periodic event like a problem at a well. Let’s switch that back to you Colin to talk about non-relational data. I think a lot of times people just think about data virtualization as a relational only, or as a traditional, in this kind of traditional view.
14
Colin White
Well I think that’s what I said, that virtualization’s [inaudible 19:25] very much for data warehousing and structured data. I think from a use case point it’s moved to extend its use cases beyond just analytical and reporting, BI-type processing to support other kinds of use cases. I think the same is true with data as well. But most of the examples that we’ve used so far have talked about structured data and that’s certainly where the early products started, but as virtualization has grown the existing products have evolved and become more mature, added additional capabilities and supported other data sources. We’ve also seen quite a few other products, some other domains coming into the virtualization space. So you’ve got to be a little bit careful about what products you choose here. I think a key focus area on some of the some of the mature products and some of the newer revolving is to support other types of data. I think XML is particularly a strong focus. We’re now starting to see, even within the warehousing space, bringing XML into data warehouses and analyzing that information. A lot of it’s coming from the web. It’s fairly easy in a web environment to extract XML data or convert XML data into another type, the web data into XML format. So that’s quite a popular source. Another one is obstructed data when people start to process weblogs for example.
And then the other direction of course, in the operational world, not only did we build our own applications but we actually used application packages like DRP system and DRM systems and that’s another source that many of the virtualization products are supporting now.
Lastly, is web services. Many companies are now starting to build service-oriented architectures where we’re actually encapsulating data as a web service and again that’s an ideal service that some virtualization can afford.
15
But as always, like we say on the next slide, there are things we need to sort of consider in using virtualization technology. There’s no question that the ability to access this much broader sort of data sources supports a wide sort of case uses. But the other thing is, there’s no magic here that if our data’s poorly structured or if they have data quality issues, then virtualization can’t sort of make up for those structural problems. There are some things we can do, as Claudia will discuss in a minute from a data quality viewpoint. But if you’ve got very poorly structured data or very poor data quality, it’s very difficult to clean that up dynamically, as you query the data through virtualization. I think the other thing is, and I think this applies to virtualization in general, that particularly obstructed data there’s no performance magic here. It’s very important to consider the performance architecture of a virtualization product. But if you’re joining multiple tables across multiple sites, or multiple data sources across multiple sites, be it structured or unstructured, and those data sources are very voluminous in nature, it takes a finite amount of time to retrieve that data. So again, you’ve got to take that into account and use data virtualization with unstructured data as well as structured data.
But let’s go back to the data quality issue. Claudia, I think you’ve got some views about data quality complexity from a virtualization viewpoint.
16
Claudia Imhoff
Oh I do indeed Colin, thank you. Yeah you’ve hit on it quite nicely. Data virtualization happens at run time. It means that we’ve got a few seconds to integrate the data, much less clean it up and transform it. So certainly, for data virtualization to work, it can handle simple formatting differences: a data change or a format change of some sort or some kind of easy data lookup and very straight forward translation of information. Perhaps it’s a currency translation or something like that. But if we get into the heavy duty stuff, the heavy duty data cleansing that may be required, just doing multiple lookups trying to figure out if this piece of data matched this other piece of data, should they be brought in together in the same, single record and so forth, then you may need to have the specialized data quality services, if you’ve got multiple versions of the same customer for example, and those types of things. So as I said, data virtualization can certainly handle simple data transformations. For example, while integrating data from multiple ERP systems, the data is virtually identical across the ERP system, we’re not even doing formatting changes. It’s just basically merging it all together. If the data cleanup, the data transformation does affect the performance of the operational systems, or it can’t be performed without having a rather unacceptable bit of latency, then we need to switch our technologies from data virtualization actually back to data consolidation. I think that’s an important thing to do.
17
That brings me to my next slide, which is we’ve got to monitor things. Obviously if I’m going to access the operational systems, the sources without having to retool them, without doing anything to them, that is one of the great blessings of data virtualization. I can access the data in place, I don’t have to change my operational systems very much. However, that does mean that we are exposing our operational sources to a very different kind of activity than it was ever designed to handle. That can have a significant and very negative effect on the operational systems performance. And that’s something that we have to realize. Again, if we’re talking about some of the pitfalls that people fall into, this is one of them. So a mandatory component of a data virtualization project has to be some kind of activity monitoring capability. That allows you to measure the performance impact on the operational systems, are you beginning to degrade them, especially in real time or over a period of time, do you see it going slower and slower? If that’s the case, you better do something about it pretty fast. The last thing you want to do is bring your operational system to their knees so that they can’t do what they’re supposed to be doing, which is the transaction processing. So it enables you to anticipate especially if you’re monitoring over a long period of time and you see performance beginning to degrade, you know when it’s going to degrade to the level that is unacceptable. That means you can do all kinds of things at that point.
You can delay these particular analytics or this particular call or query to a more suitable time when the systems are not quite so busy. Maybe during off hours or whatever. Or if you have to cancel them altogether and say I’m sorry, but we are now seeing some problems with our operational systems and these types of queries are not acceptable. So those are the situations that I would say you need to look for. But the key to this whole thing of course, is to be able to monitor the performance of the operational system in addition to doing your data virtualization. Now with that, I’m going to turn it back over to you Colin for that shared view of the data.
18
Colin White
Okay Claudia, thanks. I think we’re starting to look at sort of other benefits of data virtualization, possibly some less obvious benefits. The way I like to look at the shared view of data is if you think of [inaudible 27:16] today and the way they access data warehouses, often they have a common metadata layer, a business view in front of data warehouses or underlying data marts. So in fact the business users are isolated from any of the physical considerations or locations of that data warehouse data. And secondly, you can create a higher level business view of that. We can do the same with data virtualization, we can actually build a shared view of data, a facilitator that’s slowing into a virtualized environment and that source data can be operational data, unstructured data or even data warehousing data or combinations thereof. So what some virtualization [inaudible 28:00] ask you to do is build really a shared view of this heterogeneous data. And the advantage of that is people are using this shared view will eliminate any inconsistencies of any information that may be flowing into a data warehousing, or particularly where the virtualized environment is delivering a combined data to reporting programs, analytical programs, it means both the applications and users of those programs, they access a view of that federated or virtualized data. And that eliminates the need for developers to create private views of individual pieces of source data which reduces maintenance and speeds and lowers the cost of application development.
There’s also the possibility, of course, that you could create particular tailored views of the shared view of a particular from particular users from particular developers. The key thing is that all the data should flow through a shared view and then you may tailor the subsets of that shared view for particular purposes. That way you get consistent data and isolates will do the consumers of this data from the individual data sources. And that, in addition to reducing maintenance and speeding up development it also makes it easier to migrate data between systems. So if we are moving say from a legacy system to a more modern system, or we’re doing acquisitions or other migrations we can often move that data between these different sources without effecting that shared view. Which means that you do these migrations or change the data sources. You don’t have to modify the applications that access this data. And any kind of migration or movement is transparent to the user. And that is a significant benefit, and I think it’s very important with virtualization to think in terms of these shared views. And I believe, Bob, that you’re going to talk about a customer that actually did this, I think it was Wells Fargo Bank.
19
Robert Eve
Yes, absolutely. Thank you Bob. I think there really is an opportunity as you described, to bring a business view of data and a shared cross the enterprise view. Wells Fargo has really done that in their investment banking area where they’ve now got 200 sources have been made available through the data layer. And that’s currently supporting over 25 applications, new applications and reporting are brought on with new requirements. They did have to do some data modeling and rationalization so that those schemas make sense. But what they find as they’ve done that work, with each new project you get a lot of reuse. So they are able to support new requirements much more quickly, 50-60% faster. Now they’re starting to get 25% or so more reuse on some of the objects that they’ve built, some of the common views across the data models. That’s on the schema side and on the application side and I think Claudia makes some good points about performance, they’ve had to think about their performance if they’re going to provide data in this way. It turns out quite a few of these use cases are kind of wide shallow queries, they aren’t deep analytics necessarily. For the deep analytics you kind of have to do some advanced processing. But they have source data, they have operational data stores that they can data from. We can use a variety of caching techniques. Often times we’re getting the data from consolidated sources so it’s already been consolidated and cleansed. So they’ve been able to really rationalize a very wide architecture that’s becoming ubiquitous across the bank and save a lot of money and get a lot of flexibility.
With that, maybe we should talk a little bit more about architecture. Before we do that, this seems like a good opportunity to jump into another pole. So why don’t we bring up our second polling question? What use cases we should consider for data virtualization? We had a lot of responses in the first poll of people who are considering using it or are currently using it. So why don’t you share with us the different types of use cases that you’re considering for data virtualization? And you can select multiples of these: the augment and extend the warehouse, which would be the case we talked about when federating the two warehouses, we talked about the case of Pfizer the data warehouse prototyping, another research opportunity at Pfizer. The operational reporting with the Aera Energy case, the kind of a snapshot, maintenance management what happens during the day and accessing non-relational data, Colin talked about that. Then this common shared business view with the Wells Fargo example. So lots of different examples to refresh there where you might think about using this capability.
Colin White
Yeah, I’m anxious to see the results here.
Gary Damiano
I can pre-see it and it’s very interesting. Ok, why don’t we go ahead and publish those results.
21
Claudia Imhoff
Wow!
Colin White
That’s pretty interesting. The shared view one is interesting isn’t it?
Claudia Imhoff
Yeah, over half. I’m really somewhat surprised that the shared view is over half. I had my money bet on the extending or augmenting the data warehouse. I wasn’t far off.
Colin White
The other one of course, but even up operational reporting [inaudible 33:15], it does show the industry as a whole is going to appreciate the wider sort of use cases.
22
Colin White
Let’s talk about architecture a bit. Bob talked about a little bit of architecture, I wanted to drill down to the next level and if we actually look at virtualization technology, there are a number of different ways it’s delivered by vendors. Sometimes it’s delivered as a part of another product, which has advantages and disadvantages. The advantage is you only buy one product, the disadvantage of course is you can only use virtualization within the confines of that product and you’re locking yourself possibly into that vendor. I think the second thing is that virtualization appears at different levels of an application hierarchy. For example, with some of BI tools, it’s actually integrated into some of BI tolls itself, whereas if we take more of a composite approach it acts as a middle tier server. And lastly it can be into backend data systems, but again you’re locked into whatever the platform of that database vendor. I think the interesting discussion is between the front end embedding the virtualization to the middle tier server. I think when it’s embedded into the front end server, I think that works if you want to just access just one or two data sources. The end of the processing isn’t very complicated or the volumes are lower. Obviously that’s a faster and cheaper solution, but I think if you’re very serious about data virtualization where the data volumes are increasing or the number of sources potentially by different use cases are increasing then I think you need to move it more towards the server environment and take advantage of the server product to get the performance you need. So I think when you start to look at virtualization you need to cast the evaluated architecture and features and match that to the use cases. You have to understand the difference of performance of scalability of a product. And I think the other thing is, and we talked about the shared view, I think there’s also important characteristics of products to make sure that you’ve got the tools that enable you to actually develop these shared view, develop the metadata and also administer them.
23
And from an architecture viewpoint, if we moved to the server environment, we need to look at multithreading, parallel processing, so we can get the performance Bob mentioned in passing about caching, and as we start to combine data from multiple data sources, as we combine we [inaudible 35:54], and intelligently move the data around the network. It almost acts as a read only distributive database, a heterogeneous sort of database, this virtualization technology. But as I said, it’s also important to have developmental, and then from an administrative viewpoint, as the usage of data virtualization increases, we need to be able to control the tasks that Claudia mentioned for example as they operate and look at performance impact on operational systems we want to support every fortune debugging and just maintain the configuration of the systems. So architecture becomes very important as you start to make more use of virtualization technology. And the other thing that we need to consider is also infrastructure as we’ll see on the next slide.
24
Not only do we need to look at the architecture to virtualization technology, but we also need to look at the overall IT infrastructure, if it can actually support virtualization. Going back to the balance, Claudia mentioned at the beginning, it’s balancing your use of data virtualization and the question that some organizations as we’ve seen in Bob’s case studies have made extensive use of virtualization to push the technology to its limits. There’s other companies sort of putting their toe in the water and just trying it out. And some of that balance is effected by the infrastructure over all the IT systems because you’ve got to have the performance in place from both a server viewpoint, from a network viewpoint and you’ve got to take a close look at that, all the way from the client through to the middle tier and backend server configuration to the network and then balance that against the benefit you’re getting from data virtualization. So we need to think about this then.
25
And I think another thing we’ll see then on the next slide is that we need to consider also the application and architecture requirements of a virtualization environment. As I also mentioned that we very much started with virtualization technology accessing structured data, then moving to supporting an x amount of unstructured data. Now we’ve got virtualization products supporting application packages from companies like SAP and Oracle, are also supporting web services. As companies start to modernize their application and data architectures and start to put them into a service oriented architecture, that works very well with virtualization, particularly where the products can actually consume web services. So if you’ve got data sources, particularly legacy data sources that have been put into service oriented architecture, they work pretty well in a virtualized environment because of the restructuring necessary, and I think Bob mentioned this in the Wells Fargo case, where he’s where he’s tidied up some of the architecture and data structures and he makes it much easier to bring this into the virtualized environment because I think one thing that you must consider in the same way is the data consolidation ETL approach that Claudia mentioned which is our traditional approach. The work we have to do there in analyzing source data, we have to still do that work in a virtualized environment. We still need to understand our data sources from a quality and structural viewpoint. It’s almost the analysis we need to do up front of data sources is very similar in a virtualized environment to a data consolidation ETL environment.
26
It’s always tough actually when you’ve only got an hour to do things but hopefully we’ve given you some things to think about in terms of use cases and things to watch out for, so let’s try and summarize about where we are then in virtualization. As we saw from the polls, it’s gaining in usage throughout the enterprise and I think the number of use cases it can be applied to is increasing and as IT organizations begin to understand where you can apply the technology and it’s gained a balance as Claudia mentioned. I think another thing is a number of vendors because of the popularity of virtualization, everything from database vendors to data integration vendors to even BI vendors are starting to support virtualization technology and we’ve got this trade off. The disadvantage of course if you buy a packaged into an existing solution is that you’re tied to that platform. So there is always a number of advantages to using independent products and using in combination with that platform of course is what the Composite’s approach is about. But Claudia you may like to talk about additional resource where people can get more information about data virtualization and possibly drill down into some of the things that we’ve been discussing.
27
Claudia Imhoff
Absolutely. Fortunately, and I want to thank the Data Warehousing Institute for this, they gave you, Colin, and me the opportunity to write one of their Ten Mistakes booklets. It just came out this week, so it’s literally hot off the presses. Ten Mistakes to Avoid When Using Data Federation Technology. Again, data virtualization and data federation is just different terms for the same thing. We go into a bit more detail, there are 10 mistakes to avoid. You heard a few of them today but I think there’s more in the booklet than we can talk about in an hour, as Colin mentioned. The second one is a terrific tool that Composite has come up with, and that’s the second bullet point there, Data Virtualization Strategy Recommendation Tool. In other words, you put in your scenario, you give some details and the tool can actually tell you is this appropriate or not appropriate to use data virtualization in this scenario. So truly a terrific tool and I do encourage you to download that, as well as our booklet. It is a PDF that you can download from the Data Warehousing Institute’s website and that is www.tdwi.org. So with that I’m going to turn it back over to Composites for the Q&A section of our presentation today.
28
Gary Damiano
Thanks Claudia. I know that many of our audience has questions and we’ve certainly seen a flow of questions coming in during the session. We have some time, so at this time I’d like to answer or throw some questions out to the panel for discussion. One question that caught my eye, Claudia, I think you actually responded probably to the person who asked it. The question was what was the difference between virtualization and cloud computing? I know there’s a lot of confusion out in the marketplace over the terms and to a certain extent the misuse of terms. Let me throw that one up to the panel and let you talk about that.
Claudia Imhoff
Colin, why don’t you start since you’ve been studying cloud quite a bit.
Colin White
Sure, I mean there’s been a move towards cloud computing in various ways, both at the infrastructure level and the platform level, but also at the service level. So we are moving data into the cloud and doing BI and analytics and in some cases we’re even moving data warehouses into the cloud. The big challenge that comes up as we move data into the cloud is enterprise integration. In other words, particularly if we’re doing BI, we’ve got some data on premise and we’ve got some data in the cloud and we need to bring that together. Either through data movement which is moving data backwards and forwards, or to create a virtualized view of that. So data virtualization in the same way it can handle heterogeneous sources on premises it can handle heterogeneous sources where some of the data in the cloud and some of the data is on premises. The same as always is performance if you’ve got a wide area network involved, you’ve got to look at the performance of that network. But certainly, virtualization extends quite nicely to the cloud environment and is a part of a strategy you have to develop about enterprise integration when you’ve got some data in the cloud and some data on premises.
Claudia Imhoff
That’s a good way of putting it. The other thing that I would say if this clarifies it for Michelle a little bit is the cloud is still a physical storage device. You are physically storing the data there, not virtually. And I did want to make that distinction. I’m not sure if that’s where your question was going. I think Colin has certainly given you the idea of how virtualization can help yet another source of data, but it is in fact a storage mechanism.
Colin White
It’s another physical data source.
Claudia Imhoff
Right.
Robert Eve
Yeah, I think it’s kind of interesting. We often times think about the consuming applications, the BI reports, the portal and that sort of thing. And then the source application. The cloud, in some ways, can be either a source, for example it could be a sales force, and we’re getting some sales information about a customer and maybe we’re going to our SAP back office system on premises to look at the invoicing and shipment history and maybe we’re going to our master data management warehouse from IBM and joining that data to get a kind of unified view of the customer. So one of those sources, in this case sales force, was in the cloud. Maybe the analytic application is in the cloud and all the data sources are on premises and then in that case we’re providing the data to the cloud consuming application. You can kind of put the, you can have the cloud either in the source or in the consumer and I think we enable that pretty well. As Colin said, the way a product like ours or data virtualization solution is architected, it tends to be architected to one well across the network and therefore it extends across the internet through the firewall, etc. and just with the extra configuration activity.
Colin White
I think the other thing Bob is we often think of virtualization, the consumer being so BI tool or user, but another consumer could be a data consolidation tool. Whereas an ETL tool can use a virtualized view as a source, which means we can move data from the cloud on premises and back again, but that movement would be transparent to a data consolidation product. So that’s why I think we always think of virtualization technology as being part of a data integration strategy.
Robert Eve
Well I think it’s a great point, Colin. In a sense, as part of the day to day integration technology, think of it as another tool in your toolbox. Maybe sometimes you use a consolidator approach, sometimes you use a virtualized approach and as we talked about today, sometimes we use a combination. And I think one of the key points is don’t constrain yourself. Solve the problem using the best combination technology for solving the problem.
Gary Damiano
Yeah, you know bringing it back to one of Claudia’s earlier points about one of the practices that she was walking out was people not using federation when they really should. This is probably another example of people who should be using federation, even though they’re in a cloud computing environment and really leveraging what it can do.
Colin White
And then another thing I noticed another question that came up is we tend to talk about virtualization data, but of course you can virtualize metadata as well and then if you’ve got multiple metadata repositories, I’ve seen people consolidate metadata using virtualization technology. Particularly if you’ve got multiple ERP systems are from the same vendor there are ways of consolidating both data and metadata from those kind of packages.
Gary Damiano
Ok, let’s take on another question. The question was, when we query how does Composite handle query performance, especially in areas where it’s pulling from different sources?
Robert Eve
Well I think the first thing that happens is, if you think about the process, is you sort of build a view that combines the data from the multiple sources and then our optimizers automatically determine the best use and cost space optimization and other rule-based optimization techniques. The best way to optimize across those multiple sources. Now that’s automatic. Now you can start to do things in addition to that such as start to create some caches, work on the query, and what we do is when we attach to those sources, we don’t just kind of go through ODBC and connect, we have a good understanding of the optimizers of the source system for example if we’re getting data from the TIZA, if we’re getting data from Oracle we fully understand those optimizers and use the best practices across in each of them and then in combination for the specifics of a query.
Gary Damiano
Ok, there’s another question. Is there any reason not to consider virtualization with metadata repositories?
Claudia Imhoff
I thought that was a great question. And I really thank the person from the audience for submitting it. The answer is no, as long as they’re open to being tapped into and accessed. I think it’s a terrific, it’s one that I hadn’t even thought about. I think it’s a terrific idea, using data virtualization to bring together the various components of metadata. Colin?
Colin White
And that’s sort of, I did see that question and I commented sort of in passing in the previous question and I think people do that. And there’s two ways of doing it. One is through repositories of API. You can access the data dynamically. If not, you could actually unload the repositories into set files and then access it that way but certainly, most repositories have API and of course the volume there is much lower. We’re not talking about huge amounts of data. So I have seen people do that for consolidating or creating virtual views of multiple metadata repositories. And if you’ve got say multiple copies of the same application package in different locations, not only can you create a single virtual view of that data, you can also create a single view of the virtual data as well.
Claudia Imhoff
Since there are updates to it in fast fashion it wouldn’t affect its performance or anything else. I think it’s a terrific idea.
Colin White
Also it’s low volume and also it’s cleaner typically. You don’t have a lot of data quality and structural problems. The only thing you’ve got to watch of course is some metadata repositories can be very complex data structures so it’s a question if you can sort of slap from those hierarchy.
Robert Eve
Or perhaps maybe we want to create a view over just some of that instead of the entire metadata source, maybe just some of that source that we need because we want to compare and share that with others.
Colin White
Of course many repository vendors are starting to provide web service APIs, their metadata to hide the actual structural complexities of the metadata.
Gary Damiano
We’ve got time for one more question so let me toss it out. Is it the right use case to have Composite in between the ETL tool and the target source systems during data integration with the intent of abstracting the underlying source and target?
Robert Eve
Well I think I’ll take the first shot at that. I think that Colin mentioned that there’s opportunity to use data virtualization to kind of create a virtual source that the ETL can use when it’s uploading to the data warehouse. But I wouldn’t put data virtualization in the slip stream between the ETL tool as it’s going from the source to the consumer. We can be a source, but I don’t want to intersect in between that because that’s what they really optimize and do such a great job on. Once you’ve integrated the data into a warehouse, you could then put data virtualization above that and put a nice virtualized view above that and combine it with other data much like the situation at Wells Fargo. So I think you can surround ETL but you don’t want to intersect ETL. Colin, Claudia?
Colin White
I think that says it very well. I agree.
Claudia Imhoff
Yeah.
Gary Damiano
I’d like to thank you for your participation in the Q&A, that’s about all the time we have here today. This has been a great session. We’ve talked about when to use federation , and then maybe when not to use it, we’ve talked about as in the case of Pfizer how they’re using it for prototyping. We talked about how Wells Fargo is using it for shared views of the data. We learned that you need to think about the amount of data, whether it’s too little or too much. And we also talked about data quality and not waiting for perfect data. So there’s a lot of interesting mistakes to avoid and best practices to follow that we’ve covered in this webcast. So I’d like to thank our panel of experts. I’d like to thank our audience for their questions and their participation in the polls and in attending this webinar.
Robert Eve
One quick point on the PWI report, you need to be a member to get that report, unfortunately. Maybe we can figure out a way for Composite to get reprint rights. Some of you may not be able to immediately download the report, so just be careful on that. But often times somebody in your company is a member and I’d encourage you to become a member if this is a topic of interest. This is a great organization to belong to.
Claudia Imhoff
And it’s free.
Gary Damiano
You just sign up and it’s free membership with TDWR?
Claudia Imhoff
Yes.
Gary Damiano
We certainly recommend that everyone go and download that white paper because it really goes into more depth on the ten common mistakes of data virtualization. We have more data virtualization leadership webcasts coming your way plus we have a library of information available on our website of prior data virtualization leadership webcasts that we encourage you to take a look at and listen to. Please go to www.composite.com to register for future events or revisit some of our prior events. You could also follow us on Twitter, our handle is compositesw. This concludes today’s webcast on data virtualization mistakes to bypass. I’d like to extend a thank you to our presenters, Claudia, Colin and Bob for spending their time with us today. And a special thank you to you, our audience, for attending our webcast. Thank you for your participation and have a great day.