Prashant Sarode, VP at Wells Fargo Wholesale Bank and Head Of Solutions Design & Development/ Wholesale Managed Services

Prashant Sarode
So good afternoon, ladies and gentlemen. I’m going to cover Data Virtualization on Steroids. But before I go deep in, let me, a word of caution – I tend to speak very fast and I’m not on steroids. That can be attributed to my Indian heritage.
So a quick introduction from a group perspective. I run the Solution Design & Development Group within Wells Fargo Wholesale Bank, and the group is a share service organization. It owns infrastructure software. It covers a broad span of, it is not just data virtualization unlike some of the conversations which we heard this morning just specifically are on data. We have share services around great computing, we have share services around delivery channels facing internet, and so on and so forth. The group’s charter is beyond wholesale. It’s across – we have some 70 lines of business within wholesale. In the new Wachovia it’s most a bank so that’s the overview in general.

This presentation is not about data virtualization overview. You guys have heard enough about it. Neither is this presentation about MapReduce. I tend to use word MR pretty liberally. But this is neither an introduction of MapReduce. I’m assuming that our friend in media fixated around Google has already talked enough about how to do search indexing, using map reviews and you guys are already excited about it. This presentation is specifically about how we accommodate a data virtualization customer much after the same expedience you guys have been hearing this morning and yesterday, and how did we merge the MapReduce concepts. And what are the business and technology implication software and what does it mean, really, to the ETL space, what does it mean to the analytics in particular that’s one of the heart areas within the merged banks. If time permitting, we’ll go into other detail use cases of this.



So, quickly, from a data virtualization perspective, typical footprint. We started in 2006. We have a wide range of use cases from client master reference data to CRM data, to operational data, to risk data, going again Cybiz, going again Cybiz web sciences, exposing it in the form of web services. And the combination of play for composite of the data virtualization within Wells Fargo actually came from Wachovia Corporate Investment Banking and then we had that footprint since 2006, as I said.
We had predominately played with two flavors of data virtualization to, I would say, modernize your applications in the form of data services being exposed. Having data services this is a layer exposing the form of website to make first class citizen as realized quickly exposing the conventional way of exposing using CABT source. That’s where we started from. It used to be Wachovia [CABT] thing deployed predominately in the East, but now we are trying to socialize virtualization in the larger Wells west and trying to orient people who are predominantly deal oriented and then putting it into their road map before you consider the ITIL project how you consider you want to talk to the virtualization dimension of your ITIL initiative. So that’s where we are in the Wells Fargo prospective.

Why this whole integration of data virtualization and MapReduce matters? We have heard enough from several folks composite or virtualization philosophies is greater breadth. But when it comes to breadth of data sources breadth when it comes to and stealing your terminology breadth, but it’s not greater than depth. It rather delegates the depth to some other high performance technology. So if you say, “Hey, I’m not willing to wait for four hours because the store procedure is taking four hours, it becomes particularly in a batch oriented, retail-oriented problem, so that’s where the ETL is great for virtualization why long time, to wait for a long time so that was a specialty. If you look on the left hand side there’s little warning going on, getting bigger and bigger. The culmination or tipping point is where you see and what we have done is we have taken MapReduce and kind of looking to move the pinpoint upwards more real time and towards larger data search, so that’s the opportunity that we are trying to explore and make it second nature that causes the wholesale and eventually the whole wider enterprise. The idea of DVMR is data virtualization which is DV and MapReduce which is MR. When you combine together and the output is bigger than the sum of the parts, so that is our terminology.

The MR aspect of DVMR presentation that we are going to talk about today is fundamentally built on QGrid. So QGrid is our internal grid that we built so people can use their grid on a shared infrastructure. We can do MapReduce on top of QGrid; however, the notion of MR will be your Hadoop if we build something like QGrid using Grid Gain or anything else that you could think of you could use simply [inaudible 0:06:17] if you want to.

But QGrid is our internal intra-framework on the share services infrastructure. Briefly, QGrid stands for Query. The whole idea was how do I create a specialized MPP kind of multi-code layer which is neither [inaudible 0:06:40] nor layer and often it’s to people so we could do some high end processing. When we were designing QGrid about a year ago, I’ll be honest – I didn’t know what MapReduce was. The whole philosophy was we were from the grid side of the investment banking side of the world and we were very used to grids. So we were trying to build a confrontational grid which Hadoop could take with the data grid. So we didn’t have that solution.
So if at all you are trying to create what I think there is another company in the market called DR Spaces. They probably have something similar where they have confrontational grid collocated within [inaudible 07:28] and then the whole object was we wanted it to be lasting and again, if you just have a grid but you don’t have data, then what do you do? Your grid is false; Your bottleneck is pushed down by other data. So that’s the problem you are trying to solve. And somebody came and say, “Hey, looking at that, you could do MapReduce on it.” And I say, “What is MapReduce? I don’t know about it.” And we started looking at Haloop. We evaluated Hadoop and Hadoop is great.

But fundamentally in Hadoop, we had to move the data, ETL it into a Hadoop file system and that’s an ETL step and that’s what we are worried in QGrid.
So architecturally it’s not a very complicated idea as such. Towards the bottom we have an elastic memory cloud service which we already had. Again, the concept is shared community hardware. On top of it we had this memory cloud established. So for us it was fairly simple to burn a notion of a Q broker and a notion of Q Nodes and notion of P Nodes. And all the Q Nodes and P Nodes are actually grid applications and they are talking to each other in a custom protocol that we designed.
And on this side we are talking about data intensity application. So I hate to get into what is BI, what is reporting, what is analytic so I kind of said, I don’t care. It could be a turning application; it could be a frequency application HFD trying to do back listing; it could be a risk application; it could be a fraud forensic application, any application which needs either temporarily or for an reasonably long period of time a shared infrastructure which typically if you can get it on to a public cloud like Amazon but simply because of your corporate secrecy policy you can’t take your data then process it so then we have QGrid do that.
So that was the original intent. As you see, where we are going towards the model of no ITIL. What we are saying is instead of bothering [inaudible 0:09:34] or the operation store with a complicated Grid E spanning 10 million rows what if I make a Grid E less complicated and fire the Grid E with smaller rows and sets? 10 million rows and fire roughly 1 million of those query, which might be efficient, then bother the operation store for that small Grid E and it should be able to take that so that ties into the memory cloud and then process it using the P Node.
So that’s the final item and then we could have any number of Q Nodes based on how elastic you want to go and how big your data set is. Similarly you could have different Q Node types, so say different data and you could have different P Node types doing different sorts of things. So that’s the fundamental idea.

Again with a share center organization going to use for political battles, my infrastructure, your infrastructure, the model is we support both the models. So we can have four quad core azuls , four quad cores infrastructure and you could have a Q below here. QGrid.wellsfargo.com/treasury.
Those guys are doing something over there interesting. They said, “Hey, I need more power and I don’t have that much machines to software their QGrid so they could add incrementally only what they need and we can guarantee that their boxes will be used for only their processing; nobody else can go there. And that‘s the provision part of the QGrid.

The value proposition multiple use cases you hear about not reducing. You are talking about 20 terabytes of data processing. We have several case uses around that but we are talking about frequency returning application matrix; we are talking about risk engines, risk analytics, fraud forensics is another one, so that’s another example. Or just trying to exit it and make the back office, right? Somebody is trying out and I got an email just this morning - our FX team is trying to solve a problem by the end of the day. They are valuing FX options portfolio some 300 times. And [inaudible 0:11:55] just for that end of the day processing to make it fast. So should they go in and make a case for 10 cyber-base solution or should they ask for to spend more solutions for [inaudible 0:12:08?] That’s the model that we are taking them to.

Technical perspective is – many people get the idea of MR thanks to the media, but then it’s not that easy. I’m not trying to shy away from being modest, but I’m saying it’s like finding out people have gotten used to the idea that oh, you do market trading programming, but then they ask what do you care about market trading? Just write page [inaudible 0:12:41] and specify the number of instances and you’re done. Now getting those folks to unlearn that market trading programming and now say [inaudible 0:12:51] If I have four Nodes each Node having two sockets, each socket having four cores in it, how do I write core if I’m [inaudible 0:13:02], how do I write core? [inaudible 0:13:07] hardware? So that was missing. So QGrid has that value proposition that variant people can reduce [inaudible 0:13:16] and simply implement certain [inaudible 0:13:21] and all the sharing infrastructure at the time and the size they want.

Why we decided to integrate Composite on this trans-serve? Honest answer? QGrid didn’t have a SQL interface. That was the biggest problem I saw. I didn’t have money to go and buy something like an astrodata, which we eventually are now experimenting with. The greatest value of an MPP [inaudible 0:13:50] all our data bases are becoming MapReduce-enabled. So you could say, “Hey, my model of the data is SQL, it’s partitioned across the community hardware and now I can do a select star and then have table function and table function is operating on 10 million [inaudible 0:14:11] onto maybe 10-20 Nodes using some kind of charting logic.
So I didn’t have that; QGrid was not [inaudible 0:14:21]. People didn’t have big enough problems to go and [inaudible 0:14:26] MapReduce or make a case for Hadoop and we had QGrid, but then some of the folks weren’t using because it doesn’t have a SQL [inaudible 0:14:35] interface and then we have kind of a streaming interface. So we said, okay, if I extend.

So finally what we did was we said now data virtualization is a grid. I can quickly work things into a view or a VC or I can take the same view, [inaudible 0:14:53] gateway programming. At the same time I can have the same view publishers already to see. I’m kind of addressing a wide variety of community if I just address that. So make QGrid SQL compliant with [inaudible 0:15:08] connected and they way we connected it by way we took PIs off Composite and implemented a custom data source. The custom data source is simply weighing against the QGrid.

So our Composite infrastructure, when somebody asks the question, “What do you do when you have a very complicated transformation? Do we do it on Composite [inaudible 0:15:28]?” No, we don’t do it. So what we do is when we encounter situations where we have humongous data and very complicated for us, I think, where you go beyond your SQL capabilities and your programmer description, we said, “Hey, let’s not bother the Composite layer and infrastructure; let’s create a layer which is neither data base.” Then what do you do? You write a smart store procedure. Or you write a smart store procedure and scale your infrastructure. And depending on [inaudible 0:15:59] your scale out ratios are going to be a $150,000 machine and 5000, depending upon what infrastructure you have.
So that’s the model you’re wearing. What we said is let’s create a layer which is neither app server nor composite infrastructure nor database layer, but it’s a commodity layer. And a group like ours which has infrastructure, has [inaudible 0:16:23] infrastructure, can support that things for ETL kind of duties for some questions which you need to ask. But you cannot ask them to Composite because of the data size because of the complexity of it and the process that much amount of data in much faster time.
So now the validity of Composite in terms of being wide is still there but the depth which was missing is brought to you by this particular MR infrastructure layer. So now a 20-minute fetch in processing is now 19 seconds fetch and process data. So you’re giving the ultimate end result. Would you care to mix it with another data source to composite? So that’s the whole idea over here.
And we are planning to the analytics and analytics is the most heart area within the bank – several dimensions. From a size perspective Wachovia before merger was 150-120,000 people company. Wells Fargo was a middle size, another 250,000 people company. And there are so many systems, so many data sources. It’s just a lot of unfortunately to cross a lot of unfortunately to figure out where [inaudible 0:17:47] is happening. And so those are two grid which are very analytics focused.
So we said, “Hey, what are you guys doing, really?” People are trying to do data discovery and profiling. Now again, the word discovery has nothing to do with composite discovery and the word profiling has nothing to do with implementing a profile into it. These are simply the general steps. What happens in step 1? You’ll find the data sources are in variation in the data and then you go into this and negotiation with the data owner. And no matter who says what, I will enforce the packages. The person or the group who has the data is the king. And they can create enough bottlenecks for you.
So negotiation versus the data owner of cleaning and reformatting it bring the data to Excel, Microsoft Access and then repeat the process for [inaudible 0:18:37]. That’s how analysts’ life cycle started with and if you really look at it, step 1 is of lowest value, but it’s the most difficult step. Then you work step 2. Well, they have figured out the data looks good; I’ve figured out oh, zip codes are not in correct format and in Excel I’m going to do some kind of a modeling. But, again, Excel, 65,000 row size is the limit. I think a new version is probably more than a million or something. But they’re trying to do it on their local desktop. So we said, okay, how can we help you do what you do in Excel inefficiently when you actually chunk your 1 million records in 65,000 and run some [inaudible 0:19:22] extract out of it. How do we help you with that?

And then finally they are going to publish them all. So we said, okay, your level of difficulty, as I said, highest in step 1; some are highest in step 2 and then step 3 is pretty well known at this point of time. So we are proposing and socializing data virtualization. We can help you in step 1 and 2 using Composite, normal Composite, not the later discovery. We can access the discovery process we’re very quickly building the data use and then publishing them for you to access in Excel, [inaudible 0:20:00] or what have you.
And all the data merging aspects again we can bring in the DVMR. Data view creation has a nice GUI and it’s a two to three days effort, for people in my group who know QGrid and infrastructuring. If we explain what it is that they want it’s a three to four day effort for us to give the analysts what they want as a SQL available function. So now we are connecting data virtualization and MR and giving them DVMR to install their step 2. Step 3 is published through your run time so if you’re more [inaudible 0:20:45] we have QGrid over there too.
So for business implications why are you even trying to do this? What are the key raw materials for your dollar worthy analytics? So we said, okay, you really need Smart Quants. You can’t write a Smart Model unless you have Smart Quants. Then you need good data from diverse sources. That’s another second important thing. Then you need to go to sample size. Sometimes you may have one which looks very promising, but then the moment you run that model or fire some other data, the model starts to fall apart. And then our housing market is a good example.
Cost-effective technology infrastructure – another day you may have a question and who’s answering is unknown to you. What is in back of the answer. You may have a $25,000 value in that answer in the form of revenue generation or in the form of cost savings. Or you may have a $5 million cost saver or revenue question. You just don’t know what the answer is. So we say, okay, fine, how does the EMDB help? Well, we make the life of LOB Quants easier. Allow them to focus on just look at the data, just be expert on at your data engineering skill. Hire the Quants, hire the guys will understand key analysis, hire the guys who can do cluster analysis and all sorts of stuff.
You viewed already about the IT part. You said okay, how did you [inaudible 0:22:14] data sources faster time to market, right? We can accelerate the discovery process and then we can [inaudible 0:22:21] by giving a virtualization option and trying to say, hey, ITIL amount is probably cut down. So having good data from diverse sources is faster. And then we are saying, hey, another day now you are talking about sample size. You’re providing them with technology to how to [inaudible 0:22:39] process terabytes or petabytes data or sometimes the data is not terabytes or petabytes, the data is simply taking longer to process. So we are giving them the ability to process the data much faster. So they are able to deal with data. We have a time series which is not in minutes, but it’s in probably microseconds.
And finally we are talking about a cost effect you [inaudible 0:23:02] shared services so the data virtualization service that we have established within the group is also a shared service. And the dollar billing is done by somebody else and that’s something I’m still trying to reconcile what it’s all about, but the whole idea is that the virtualization aspect as well as the [inaudible 0:23:24] aspects enable you to share cloud infrastructure, private cloud infrastructure, right? So (A) if you are incrementing in increments of $1,000, $10,000 or $20,000 in terms of infrastructure costs, and (B) the people who are using it, may not necessarily have to own this, so that’s the value prop for the DVMR part of it. How am I doing on time? Am I speaking too fast? Anybody’s asleep?


So here is a use case, I mean, I just as an example. This is a use case where I keep on telling people that MR or MapReduce is not just about big data. Sometimes it could be about making big time smaller time. So here is an example. We have an enterprise audit database which is sitting at the portal, at the front of the [inaudible 0:24:25]. And it’s getting threads. It records from microfilm system, some [inaudible 0:24:32]. And it has seven columns but the column of most interest is the approps column. That column is I think a club or a [inaudible 0:24:42]. And I saw it was defined to be a 2GB. And this data is to, so the data and structure itself is stored in a column which is a binary kind of a column and the size of it is unknown to me. It’s been specified to be a 2 GB.
Now what we were trying to do is build our threads which are coming in from particular system and we are trying to figure out how to organize those threads [inaudible 0:25:12], how do I find some kind of statistics around by says, what is average buy value of this, sell value, that kind of stuff. So to do such a thing just in SQL, I don’t think you will be able to do it without writing some complicated SQL and then implementing some kind of a timetable. And we get into this, “How do you do a group buy?” A lot of cloud kind of a thing. How do you do like very easily? This was a problem where we had 2.4 million rows.

We just of really can’t do it in SQL that’s more to do. Just do that much data at one chart – 20 minutes. So what do we did was we kind of applied QGrid to it. Now 20 minutes to source the data, now we just put it on the QGrid, we say, okay, we will break it into smaller transactions by sizes, bring each Q Node and simply bring the data, start putting it into the memory cloud, the P Nodes start crossing and you finally get the streaming reserved on the back.

So that part now we have the reserves over here, right? I mean, it’s an interesting initiative when the size of the rows are less. But as you go towards high end rate and are trying to process them out each time of 19 seconds. Nineteen seconds is a worthwhile time to wait depending upon the nature of your problem in Composite to get the data which is processed and now you can mix it with other sources or simply not mix it, just have it as a view.

So that’s the power that we have now which we didn’t have earlier the moment we went out for 2.4 million records on that particular source. We won’t do it. Let ETL take care of it. Let ETL or something else process this and put it into a permanent view kind of thing.

So I’m personally seeing a trend where MapReduce, I hate to use the word “killing,” but MapReduce is gradually invading the ETL la-la land. And it’s making accommodating in a couple of different ways. One of the ways which I see and hear is why care about the data model when it takes too long to built a data model and you don’t even know if it is the final data model. So why don’t we just try to solve the problem quickly?
Now to solve the problem quickly and without trying to [inaudible 0:27:56] and get the best possible answer, it may not be worthwhile. So now you make ITIL project really, really important and say, hey, thou shall make a good data model. Thou shall follow data cleaning. It becomes a very long process. Here we are saying is that whatever what it is, let’s try to ask a poignant question, let’s try to just load the data in a highly denormalized way, try to process the data, don’t worry about the data model. Get your answer. If you get an answer which is worthy enough, great. If you don’t get the answer, at least you haven’t lost much money, which is what you couldn’t do with your conventional ETL style.
So I see, I don’t know how [inaudible 0:28:39] I wasn’t sitting in the presentation. They going gradually towards MapReduce kind of [inaudible 0:28:47] They don’t use ITIL MapReduce, they are going towards that. So to me, MapReduce has to become a first class citizen for virtualization. That’s where I’m going. And in business benefits there are several technology benefits, and there are several, I think. And I think it’s reconciliation time for most of people who spend a lot of money on ETL to be challenged with, I can virtualize the data. So if you cannot go, you cannot virtualize the data, but you cannot go deep. But somebody says, hey, I can solve that depth problem by doing MR and by the way, you don’t even care about a data model. Google was doing MR; now they do big table. They do big table, they don’t do big database. It’s one table; denormalized table. So those are the trends that we were trying to stay ahead of, and does not apply to all the use cases, but to certain use cases. Thank you.

Moderator
So a question for Prashant?
Audience Member
Just a quick one on some of your MapReduce deployments in that workflow. Are those, are you persisting those very long or are you just creating, doing the analysis and then getting it into Composite to get it out to QGrid and delivered and not necessarily retaining all the data in the MapReduce store? Or are you mixing that? Sometimes you’re keeping it around; sometimes you’re not.
Prashant Sarode
I think you asked two questions, right? One is if you take a combination of MapReduce and Hadoop, you have a first ETL [inaudible 0:30:41] system. So if you don’t do that because we have this global cashing infrastructure, that brings all the data and puts into memory. We can control how long it should live over there. So that’s under our control so it can be over there. But when it’s in the cloud memory it loses the initial aspects of it and becomes a SQL and becomes purely a list of objects. Right?
When it comes to once I expose the data and store it into a cache, yeah, we could save it into cache and caching is, we tend to sell cache in different ways. Caching capability of Composite in different ways. If you don’t want us to come and hit you again and again, then basically let us build up your cache and then we’ll figure out when to hit you again next time based on the life of the data. So we use caching for that purpose. Once you bring in QGrid in picture [inaudible 0:31:40] that data, then your fortune is to put that data process it with another view into a Composite cache or you could find it again and again. I don’t think it’s a big order either way. Depends on the use case.
Moderator
Other questions?
Audience Member
How much time cost [inaudible 0:32:04]?
Prashant Sarode
It really depends on what sort of issue you are having. As I said to you, my team already had a grid pedigree by being in investment banking, grid application space. So we have the right software system, we own our infrastructure ourselves and we know about the grid very well. So then it becomes a matter of what is the data, it becomes a data engineering activity, per se, which you have to do it anyway. What if you’re writing a store procedure and you’re going, let’s turn the data, you’re going, let’s turn the business context of where you are trying to move the data, and then figure out how to do it.
So that time is neither for us who can work your requirements into QGrid. You have to work those requirements into QGrid are typically not in months; they are typically perhaps one, to one and a half, two weeks to actually code the Q Node to deploy. I’m not in a product business. I work for a bank. So I haven’t built an eclipse way for QGrid. But what we are selling QGrid as is like it becomes a stepping stone for you if you want to do Hadoop. You turn to Hadoop and say, hey, I want to do this UMI function. My office is saying [inaudible 0:33:25] engineering. How the data looks like and how to put it into Hadoop and now you worry about [inaudible 0:33:37] into it.
So eventually when the Hadoop ecosystem downloads, I was talking to[inaudible 0:33:37] have a studio which is similar to Composite, so you give the order. They can drag and drop ETL Logic and deploy it into Hadoop. So that will get into that level. So eventually I think building MapReduce functions is going to be a faster activity, not necessarily, it will be as slow as writing a store procedure and deploying it into production. I did answer your question?
Audience Member
Yes.
Prashant Sarode
So when you write a store procedure, you got to understand what you are trying to do. So to me writing a MapReduce function is equal in writing a store procedure, except it’s outside of the database, you’re trying to do something different.
Audience Member
On this diagram here, ELS is your data source, right? That’s what you’re querying against? So is it physically three, in this case, copies of ELS? Is it a single copy that’s shared? I see that’s how it’s depictured, but how is it physically instantiated?
Prashant Sarode
Particularly it is physically it’s one database. It’s shown [inaudible 0:35:01] but it’s one database. But we could go against any number of data sources. Each data source gets its own dedicated Q Node grid and a P Node grid.. So now you’re saying I have one data source and it has 20 million records. You can simply go into the grid of the virtualization aspects of QGrid and say, hey, I probably need 20 engines. Again, you can’t [inaudible 0:35:27] memory and CPU power to process that. It’s all memory based.
[END OF TAPE]