David Besemer, CTO at Composite Software

David Bessemer
…on today’s information for you. We’ve heard a lot of different discussions today about different use cases, about different potential approaches to problems and as we have learned in dealing with customers over the last eight years, everything falls into and certain number of patterns. And I want to talk about these patterns, some of which are very mainstream today, some of which are futures.

So first thing is, you know the problem, data volumes are exploding everywhere. And this is actually a study from UC Berkley a few years ago that shows data going exponential in about 2010, imagine that. And I’m sure that you see this all the time. What’s interesting about this is half of this problem is being exacerbated by the fact that we keep copying data. So even if we had our arms around all of our data, we keep copying it and data virtualization can certainly help with that part of the problem, but that s not the whole problem.
The rest of the problem is that we are producing more data, our businesses are moving faster so we need to deal with more data. We have click streams we have all kinds of trade data as you saw in the NY Stock Exchange presentation and these volumes of data are just exploding and they’re getting harder and harder to deal with. I have good news for you though, we have lots of different strategies for dealing with storing this data and analyzing it.

There’s of course, the traditional warehouses and databases and these have been great for a long time and they’ve served us very well. I’m sure all of you have at least 20 or 30 of them, right?

And of course emerging we’ve got columnar and MPP data sources that are attempting to both reduce the cost per terabyte while also increasing the responsiveness of doing queries. So not as good for transactional, but really great for analysis and retrieval.
And then of course you’ve got kind of the wild west emerging which is all the NOSQL approaches. How many of you are doing anything with NOSQL today? Right, exactly. But you’re hearing a lot about it aren’t you? It remains to be seen exactly where this will end up in the business, these approaches all emerged of course from these large website services like Google, and Yahoo, and Twitter, Facebook, etc where they have unique data needs. Not every business has those unique data needs, but nonetheless a lot of these are available for you.
So that’s the good news, there’s lots of ways you can store and analyze your data. The bad news is you’re going to end up with data all over the place. You probably already have that, it’s just going to get worse. So you want to try to do two things. You want to try to slow that down and then you need to put in place mechanisms for dealing with the fact that you have data all over the place. There are reasons that you’ll want to deal with data in different ways so you will always have data in different places.

So what do you need? Federated query of course, we’ve talked a lot about that. In fact, sometimes data virtualization is known only as federated query. You’re going to need loose coupling. And in fact this is probably one of the strongest benefits. On the panel discussion you heard that agility and that’s really what this comes down to. Location transparency so that when you want to ask a question or do a query you don’t have to know where it is, you’ve got a set of business canonicals, you query those canonicals the data comes from wherever it is. And of course semantic abstraction.
I want to define my business data in a way that everybody understands and everybody can use regardless of how it’s stored or how it was created. What does all that add up to? Well that’s what we’re trying to do with data virtualization. And as it’s emerged through the years the patterns have evolved into more and more complex use cases and they continue to do so.

So what are these patterns? There’s five of them. There’s something here for everybody in the room. If you’re brand new to data virtualization they’ll be some pretty good entry points for you here; how to get started. If you’ve been using data virtualization and you’re ready to expand its usage because you now actually trust it; there’s also some patterns here for you. And if you’ve been deep into data virtualization and you’re pushing the envelope the futures patterns will probably appeal to you. So these top three patterns – data federation, data warehouse extension, and enterprise data sharing are mainstream today and I’m going to talk briefly about each one.
The two patterns on the bottom – real time enterprise and cloud data integration - those are emerging and future. I’m sure that you’re all aware that cloud computing is doing a great job climbing up the left side of the exceeding expectations curve on the Gartner Hype Cycle. And eventually it will get around to the point and drop down in the trough of disillusionment, but it will come out the other side and when it does there will be real value in doing cloud computing on a grand scale and at Composite we’re already thinking about it. So I’m going to talk about these first three patterns somewhat briefly and that’s because they’re not about future they’re about what’s happening today, but it will again kind of put a bow on things for you today. And then I’ll spend a little bit more time on each of the bottom two.

Ok, so data federation. This is where it all started and it’s a very simple statement: “My application requires data from multiple incompatible sources.” We could probably all say this about some application or dashboard or report in our enterprise. And the whole idea behind data federation is we make that easy. Now what’s interesting about this is there’s a bunch of sub-use cases under each one of these. These are all covered and explored in depth on our website so you can go and get more information about each one of them, but the bottom line here is the idea that we marshal all of these differences in protocols, these differences in schemas, location, etc to bring that together so that your application or your dashboard or your report just doesn’t have to worry about it.
This is where a lot of our customers enter the world of data virtualization to start to use this. You can pretty much justify if you find a good use case for this for example, an executive dashboard, your executives want that report every Monday morning and that data’s got to come from 25 different systems. You can pretty much justify an investment in data virtualization just to make that easier for you. The second use case, and by the way, the data use federation has become ubiquitous and this is why business intelligence vendors have started adding data federation options to their systems, don’t go there because as soon as you put it there then you won’t be able to use it with the next thing, but it’s there if you already own it. And so this has been around like I said, since we started the company, this is where it all started. The reason I mention that is because the second use case has really only come into its own in the last few years.

And the statement here again is: “My warehouse does not contain all of the data required for the reports we need to build.” I’ll bet everybody can say that. You get requests everyday to put more stuff in the warehouse. And the idea here is, you heard this from Pfizer, that if you do that you’re in a constant state of churn. And a lot of what you’re churning you’re going to end up throwing away or regretting that you put in the warehouse and so if you can somehow use data virtualization to augment that and to give you some agility around making your data warehouse and data virtualization complement and get some synergy, then you’re going to go a long way. But this pattern didn’t emerge right away because a lot of people felt like if we did this we were really compromising the data warehouse. That’s just softened over the years, it’s not true.
The data warehouse is a very important piece of the architecture in most enterprises and some enterprises have many of them. It will always be an important piece of that architecture. Using data virtualization in conjunction with it is just a way to make it more flexible. Doesn’t make it go away, doesn’t invalidate all of the efforts to make the data warehouse something that’s important to your enterprise. Again, there’s a bunch of sub-patterns here. We heard about MDM today, Hub and Spoke, there’s some migration options here when you’ve got mergers and acquisitions going on’ there are several different ways to leverage data virtualization with your warehouses. Obviously there’s some overlap here with data federation. Sometimes you look at data federation and data warehouse extension in a similar light, but usually it comes down to what problem are you trying to solve? Am I looking at it from a dashboard report and I’ve got to get some data from multiple places? Or am I looking at it from my queue of projects that I’ve got to get done for the data warehouse?

The third pattern, and this one has really emerged and come into its own over the last couple of years is about that abstraction layer; about creating some kind of semantic abstraction and sharing that data in multiple ways across a geographic location, across a department, across multiple consumers. And the statement here is: “I need to share the same data across multiple applications.” And the key here is it needs to be the same. We’ve heard about the fact that not everybody thinks a trade is the same thing or not everybody thinks that definition of an oil well is the same thing. So now you end up in the Monday morning meeting with reports that don’t agree. And so the idea behind enterprise data sharing is that you do create these business canonicals, these business definitions of what your data should be to everybody and they all share them.
In enterprise data sharing you have more of an outward view than a view towards the data sources. Of course, federation’s going to be involved. Of course, transformation’s going to be involved, query optimization and all of the great technologies that are built in here, but the real idea is I can create an abstraction of my business data I can then publish it in multiple ways for multiple consumers so that when you do that visualization of your business this person is getting the same data about what your sales were yesterday as this person. Now some customers jump right in here and they decide that they’re going to build a shared service abstraction across their business.
And that’s actually a great place to start but we always advise them start with a few things that add value immediately. Don’t try to boil the ocean. And whenever customers do that, they get real value out of this very quickly and they can grow this very organically with a lot of reuse and a lot of value that really ends up paying for it pretty quickly. So these three patterns – data federation usually focused on a single use case or a single report or a single application, data warehouse extension to try to get more value out of your warehouse and to try to reduce the churn that you’re all experiencing, and enterprise data sharing; these are all mainstream today. We have many customers using all of these to get real work done and provide real value.
Audience Member Question
David also, and I remember Michael bringing it up earlier, but just rapid prototyping just to see if some idea somebody has is actually going to work where the more people used it ETL brings around to try to create models to try to produce something and find out yeah that’s not what we wanted and wasted weeks.
David Besemer
Yeah rapid prototype, one of the sub-cases under the warehouse extension instead of building yet another data mart is build a virtual data mart and do some prototyping around what’s the ideal mart supposed to be. And sometimes when you hand something to somebody that they’ve been asking for they immediately recognize why it wasn’t what they wanted. And so you immediately go back and you rev it, that’s great if you’ve done in virtually. If you’ve actually stood up an infrastructure and built the schema and ETL’d the data somebody’s going to be really mad.

So I want to spend the rest of the time though talking about the two patterns that are leading edge and in some cases futures; and again, a bunch of sub-cases under here. So real time enterprise, and by the way we’re still kind of playing with the name of this because real time is one of the reasons that it, one of the motivations for it, but you’ll see when I talk a little bit more about it, it’s not just about real time. And here the idea is that you need to deliver data to wherever it’s needed whenever it’s needed to support a global business. We are working with one of our global investment banks to create this for them. In the process we are doing a couple of things.
One is that we’re creating technology in our R&D lab that we hope to productize, and when I say hope it’s because we’re still figuring out exactly how it all fits together. And secondly, you heard Ted Freeman talk about diversifying how you’re using different data technologies. Things like change data capture and messaging and ETL and virtualization. This particular use case uses all of those in conjunction with one another and creates synergies by doing that.
So the idea here is that no matter where you are in this global enterprise you have the same view of data, and abstraction of the enterprises data, and you can query it and you get it when you need it. How it gets there, where it’s stored doesn’t matter. There’s really only one use case under here because this is the uber example of global data virtualization. As I’ve said, we’ve always had this in mind for our technology and we’re working with one of our large banks to actually implement it, it’s going really well and we’re really excited about it. And in doing so, we’re not only adding things to the product to make this more possible, but we’re also using it in conjunction with other technologies.

So here’s kind of a picture of what this looks like. Where you’ve got data stores in the middle and always on the bottom you’ve got the operational aspects of archive and backup.
But up at the top you’ve got a box called virtualization and distribution that is really a global single definition of all data in the enterprise. And then on top you’ve got applications that need the data, you’ve got reporting and you’ve got analytics. Notice by the way that both applications and analytics have a little carve out in the side that goes straight down to the data sources. We can talk a little bit about this offline, but there will always be reasons that you connect analytics straight to a warehouse. You don’t always have to send those through data virtualization. There will always be a type coupling between a transactional application that creates and maintains data and its data source. So you don’t want to put data virtualization in there either.
But for everything else, it will all go through this single source. So there is a data virtualization layer that is a source for all data other than those carve outs I talked about. Each data item in the enterprise has a single owner and therefore governance comes along with this whole architecture. All access is through these well defined approved data sources. Everybody gets the same definition. And those definitions are distributed in London, Tokyo, New York, wherever makes sense and everybody has the same definitions. So you have this distributed layer, you have location transparency, it’s really exciting for us to see this in action. You’ve got edge caching going on, we got change data capture in there, and we’ve got messaging so that we can actually distribute changes and notifications of changes and its pretty cool stuff. So this is a pattern that we say it’s future because we only have one large customer who’s working on it with us and there’s still some things that we’re building to make it all real, but it’s coming together and it looks really cool. Pretty exciting.

Ok, let’s talk a little bit about the cloud. How many people are doing cloud computing today? Ok a couple. What’s the key here? Well I need to integrate data that’s in the cloud with data that I still have in the enterprise. That’s always going to be the case. Yeah someday you might put your financials out in the cloud, but it’s not going to happen anytime soon. But you’d be happy to put some click stream data out there to maybe do some processing on it.
And so here it’s a pretty simple model, you have to extend data virtualization beyond the firewall and into these cloud data sources. And there’s actually about four main use cases. One of them is very real today and we have lots of customers who are doing it, and that is software’s as service. Applications like SalesForce dot com and Net Suite is another one, we have a customer using that. And that integration is real today and so that part of the cloud of course is real. The other ones are a little bit more emerging and futures based.

So this is the SaaS example. I put SalesForce dot com in the cloud there. Somebody said what are the top five SaaS applications? Well SalesForce is probably the top four and then everything drops off from there. But the idea is that you want to save money and time and you want to simplify your infrastructure so you put some SalesForce dot com technology out there. We have an adapter for SalesForce dot com today so that you can do this integration today.

The second one though is you may have some custom applications and you may decide to host those in a cloud, probably a private cloud today, but maybe eventually a public cloud. And the idea here is that you’ve created an application that you want to put on an elastic infrastructure so you can scale it at will. And you’ve got kind of two examples here, one is that that application is probably going to need some data from your enterprise. So we’re going to have to do some tunneling technology here and get through the firewall and that sort of thing.
And then visa versa if that application is resident in a cloud if it’s a public cloud it’s going to be outside your firewall, if it’s a private cloud it’s inside your firewall, but there’s going to be some data in there that you need to pull back and integrate with the data that’s still in your enterprise. So you’ve got kind of two directions here where you need to integrate data across the enterprise. Again, what are you doing? You’re trying to save money by outsourcing infrastructure and you’re trying to increase your scalability and elasticity by putting this in a cloud based infrastructure.

Ok, now you’ve got an emerging area of cloud hosted databases. Again, probably private cloud for quite a while, but eventually this could also be public cloud where you would buy data storage by the slice. Vertica and Greenplum are the two here, but there’s lots in this area and you’re going to hear from EMC and Microsoft and others on this as they start to offer cloud based data storage solutions. So again, you’re trying to save money, all of this is about simplifying your IT infrastructure, but more importantly is you’re trying to scale down data storage not by buying another oracle license. Or not by putting in another (inaudible) box, but by saying you know what, I need three more terabytes. And actually I’m doing a big campaign so I need a few terabytes just for a few months. And so you can have some elasticity around data storage and you’re buying it by the slice instead of buying it by the infrastructure and the licensing model. Again, here the primary model will be that once you have data resident in the cloud, you’re going to want to be able to integrate that data with data that you don’t have in the cloud.

Ok last one that I just want to talk about briefly is the NoSQL data source models. This is kind of the next step beyond the columnar and MPP data sources. And you’ve got Hadoop/HBase, maybe you’ve got MongoDB with some XML documents in there. Again, lot of hype around this right now, it’s coming. The reason that they exist is because they solve certain problems. If you have those problems you may want to use them. If you don’t, why bother. But the point is if you have to use them then you will end up with data documents stored in these data source, potentially in a private cloud, potentially in a public cloud and you’re going to want to pull that data and integrate it with data that you already have in your enterprise.
Now, Hadoop and HBase and MapReduce, we just heard from Prashant on this, they work at a different scale. Alright they work at a different speed. A lot of these are working on extremely large data sets, the reason MapReduce exists is so you can take advantage of that parallelism and reduce that data almost in a batch mode. That’s fine. Composite can then take the result of that and cache it and integrate it with other sources. So it’s a very natural fit for data virtualization just as Prashant was describing that they’re doing with their (inaudible). It just happens to be a very different model for processing the data.

So these are our five patterns. As I said, everything that you’ve heard today and most of the problems in your enterprise that you might think of either replicating data or creating derivative marts in some way or somehow doing some coding to integrate data will all fall into one of these patterns and could be considered a candidate for using data virtualization. The progression that we see here is always that you start somewhere you get comfortable with the technology, you start to gain confidence that the stuff actually works and that these people from Composite are actually pretty good people and they are pragmatic about helping you solve your problems and being successful.
And then you say well I could actually trust this stuff and use it in a wider set of use cases and so you go on and eventually you get to the point where some of the customers you’ve heard talk are talking about these fabrics across their enterprise for doing data virtualization for all of their use cases. It all boils down to characteristics that you just can’t do by replicating data physically. And it’s that agility that it provides that you’re going to need as you have these hundreds of data sources and silos all over your enterprise.
So I personally want to thank everybody for coming to our data virtualization days. I’m sure we’ll have some wrap up comments. I like to call this Composite world 1 and in a few years we’ll have thousands of people and hundreds of vendors displaying their wares because data virtualization has taken off all over the world. But in the meantime, we’re going to treat each use case one at a time and make sure that they’re successful and actually provide value.