Emile Werr, VP Global Data Services at New York Stock Exchange

Emile Werr
Good afternoon everyone. I want to thank Composite before I start. So today, you heard a lot about ROI actually so this presentation, you know a lot of the information that you received earlier, is very valuable in terms of what ROI and how to achieve our ROI through data virtualization. So, maybe I’ll take a little different angle and talk about performance based ROI. You know, The Exchange is huge in terms of multiple markets. We support over 14 markets, we basically have different products. We also have a technologies company which sells solutions and shared services. We’re in the regulation business, compliance business, risk management business, so a lot of people probably don’t know this about the NY Stock Exchange. So we deal with a lot of volume of data.
Just to give you an example, on an average day we deal anywhere from 1.5 to 2 terabytes of data a day, that’s our footprint. And on a peak day that equates into about 8 billion transactions per day. So today I’m going to talk to you more about, not necessarily the reference integration story because a lot of people have mentioned it, I’m going to talk about how do we deal with transactional data and how do we keep data in place because obviously it’s not practical for us to move it around. We have many sophisticated ETL capabilities, but we do some clever things in terms of integration so I’ll mention some of that. And that’s really part of ROI, its part of the value.

So this is just sort of a background in terms of the volume. Today I look at my Blackberry and every day I get a market share report and it’s interesting because this market share report now has to cover, like I said, the NYC Classic, NYC ARCA, NYC Options, SmartPool, NYC Life and so forth and so forth. And that’s in total about 14 different markets and there’s a lot of transactional, and executive management and the outside world is interested in about how, what kind of market share the NY Stock, NYC (inaudible) does and in order to get those types of numbers it’s pretty difficult. So, even this Blackberry notification that I get every day, it’s still got specific buckets for each market and what we want to do is we want to move forward. We want to basically be visionary future thinking about how do we integrate this information, how do we get information to the executive, to our customers. So a lot of the stuff that we’re doing really is about dealing with mass amounts of data and basically coming up with standards and how do we adapt. Because the Exchange’s vision is really to continue moving forward in terms of being kind of the fore runner in terms of buying Exchanges and basically providing added value services like market data feed for example, Super-Feed. So there’s a lot of stuff that’s going on.

Some background, I’ve mentioned basically my background is really Head of Enterprise Data Architecture for NYC so I report right into the Chief Data Officer just to give you sort of an organization perspective. We take that very seriously at the Exchange, data management is hug cost to the firm, so obviously we want to do a very good job. And I referenced architecture, Ted mentioned that nice slide about share data service which we’re definitely about building global data services, share data services, coming up with common infrastructure, computing, provisioning computing infrastructure; all the things that you’re hearing now, we’re starting to do. And we’re looking at different sort of areas within that and we have different centers of excellence around each one of those areas and then we sort of have an architectural council that sits over that.
And we have a data management or sort of data governance council that eventually is going to sit on top of that. We don’t have the data governance yet, but we’re going towards that. So we know that’s all part of the practice. My background has been in a financial services space, that’s all I’ve done in terms of data management within many top tier companies. I founded a company called Expheria, which really focuses on high performance and that’s the thing I enjoy about this part of the job is that I combine solutions and innovative patterns to solve very complex transactional data problems.
Team responsibilities – I have a team that I’m running in NY and I’ve got folks also in Chicago and in UK. Basically we focus on all these various verticals within the IT domain.
Business intelligence is important to us. We’re very immature, I have to say, we’re not, our users like to basically see the physical model, the develop SQL and execute SQL through pass through query tools; we’re trying to move away from that. And when you have that much transactional data you cannot basically go down that path because one bad query actually impacts everyone. So we’re sort of moving now, taking step back and saying how do we basically provide information capabilities. Better integration capabilities to end user. So become more business intelligence aware.
Common data services – absolutely. We do a lot of work with our customers, our customers being either internal or external. We are now moving quickly towards building data services. A lot of the stuff that we do is very complex. We have a whole grid environment as you see on the bottom, which really where we process most of our transactional data. For example, one big example that we do every day is basically market reconstruction. To do compliance and regulation you absolutely don’t want to market reconstruction out of the database, you have to create all the data point integration right outside. You have to almost create your book and your depth right as part of your ETL process and then just put it into your data warehouse and make that accessible through basic either SQL or BI tools.
Data architecture is key.
I mentioned something about one of the business use cases which is the market reconstruction, but data architecture is absolutely a must if you really want to use data virtualization, they go hand in hand. Data virtualization is about joining information, even if you use it just for a single instance, even within an instance whether it’s an Oracle or DB2 or whatever you use, you still have complex relationships that sometimes you have to resolve. And you still need to know that primary and foreign key relationships actually are valid. And a lot of thought process needs to go into that as you start working with sourcing the upstream systems.
Coming up with a common business nomenclature is important because our world is huge. We have many different acquisitions and mergers that have happened over time, everyone’s got their own style, their own naming conventions. We’re dealing with multinational companies in different languages, there’s different cultures so we need to come up with sort of even a virtual data dictionary, if you will, that one that caters to the French folks, one that caters to the UK and the US. So we’re actually moving from all different directions.
Data access integration – a lot of people talk about data architecture and they talk about ETL and they talk about all these great things that you need to do to manage data, but I think one big area that’s always missed is data access architecture and that’s where Composite is sort of kind of beginning to fulfill that particular data point or technology as part of our stack in our reference architecture.
We attempted by the way to do a lot of this data access ourselves. We have very innovative technology folks, smart people obviously they deal with that much volume you need to do sort of heavy duty programming and we tried to do some of this middle tier data access middle ware and it’s pretty complex. As a matter of fact, if you notice in the industry, you see a lot of vendors out there that provide ETL capabilities, you see a lot of vendors out there provide BI solutions, they’re very predictable because you have clear mapping, you pretty much have sort of a finite number of predictable things or events or functions that you perform. In the world of data virtualization it’s very difficult to sort of guess in terms of how you’re data sort of aligns; we talked about data quality. Or how much volume you’re dealing with or the types of distribution that you have in your environment. So there’s a lot of unpredictable things which makes data access integration much more complex. And there’s also the organizational culture and politics that goes along with that.
Data warehousing – we’re a big data warehousing shop obviously. We use many different technologies and I’ll point some of them out later, in terms of what we evaluate, how we evaluate at Composite and what we’re planning to do. And then ETL and ELT design architecture design is really key. ELT is big, especially now with the whole MPP space. We can leverage the parallel capabilities of these technologies so we can reuse the database for many different means.

So I mentioned a lot of our complexity really comes about from mergers and acquisitions. In the last five years we grew so much. Our market share today, just on the US market is about 28% in the equity market. We were losing market share, we’re regaining market share because we’re making the right acquisitions in terms of the various types of products that we support. We support all products equities, options, derivatives, swaps, bonds, convertibles, you name it we’re involved in that type of business now. SmartPool so Dark Pool, everything that you can imagine in terms of complex trading environments we have. The other thing that makes it really difficult for us obviously is the amount of data that we need to keep online because we were self regulated; obviously we need to make sure that we have a fair market, that’s the whole point of an Exchange. So the market surveillance algorithms that have to run every day are extremely complex, they’re scanning everything.
They’re looking for patterns for what happened yesterday and comparing them to patterns that have been happening over three months. We’re now sort of being kind of more integrated with FINRA as FINRA’s taken over some of those responsibilities for regulation. SEC, FINRA, and NYC and NASDAQ are really spearheading a lot of the things that are going on in the market regulation space, especially after the flash crash and all of the things that you’ve been hearing. So there’s been some shake ups in terms of technology and the requirements, the stringent requirements that are necessary and needed to get this right and to protect ultimately the customers that are outside.
Multiple lines of businesses – we just give you a flavor of some of the different areas that we support. And we try to do this again, smart because we have for example the NYC Classic we produce about 2.5 billion transactions per day, which is what the big board basically. We don’t want to be moving that data around, we want to basically be able to support the same set of functionalities if you’re doing capacity planning for NYC Life and capacity planning for NYC Classic, you want to basically be able to virtualize that obviously. So we have smarts in our data architecture where we need to aggregate. We’ll aggregate but most cases we want to get at the transactional level detail when we can in place.
Research is a key part of the business because they’re always fine tuning their fees and commissions and models in terms of how the business is going to make money. Member administration is important to our, the people that do the trading, the specialists and the market makers and all the folks that are basically the agents that interact with the trading venues.
Listings – basically those are the issues, the products that are being traded. So there’s a lot of reference data in that space. Finance wants to see the numbers. They want to see volumes, they want to see revenue, they want to see it across the board; it’s very hard. Again, this report today, we’re not able to do that. There’s some even cultural challenges and legal challenges in doing that. Getting information from Europe or from France over to New York, there’s some legal ramifications. So we have to also take that into account in terms of our data virtualization model. Web systems and communication obviously NYC is a big brand, we have to do a good job of getting information, very useful information in a timely fashion from an external facing perspective.

I mentioned about the daily volumes, to give you an idea on average, pretty much NYC Classic 1.5. On the day that we had the flash crash we did 4 billion transactions. So that gives you the elasticity of the architecture, it has to be there. When we design our systems we always try to design systems that are at least two times greater than what we have today. In the case of the flash crash, we didn’t break we didn’t have any outages, because we designed it properly and that was one of those buckets I said earlier about capacity and planning. We’re always calibrating, we’re always identifying the through put requirements and the database requirements and the operational requirements every day. Because we cannot, absolutely The Exchange cannot come down. You don’t see in the news that the NY Stock Exchange even goes down for a minute.
We have multiple database systems as you can see. We’re starting to sort of grasp MPP and appliances. Four years ago if you came to me as a data architect I would basically said no way I would never use an appliance, it’s too proprietary. But we realized that with the kind of mass that we’re using and the push of appliances integrating with third party tools and the standards around SQL interfaces, it’s really not so bad. The number of moving parts are being reduced. So we’re leveraging the MPP systems we have in house (inaudible) for example. And we also have Greenplum, but we’re making good use of these technologies because we’re trying to spend less time on performance optimization, which is what a traditional architecture database group does, and more time on business processes and integration and ETL like type of functionality.
Replication – replication is fine when you talk about master reference data and that’s very valid. In our case, it is valid because a lot of our trading engines will integrate directly with those ODSs so we don’t want reports to run against them, so we will replicate. So we have this whole prescription of what we do and when we do it and part of hat is based on the SLAs all the way upstream to the matching engine. So we make those decisions, we say this is the right technology, this is the use case that we need to use it for. Replication is not valid for transactional data; we don’t move it around.
So knowing that in advance, as we start to kind of collaborate with all the leads from all these various areas of the business, we identify the common intersections of all our transactional systems and we ensure that as part of the standardization that we focus on those, if you will, primary and foreign keys across those federations. And not so much worry about the details of what goes into an option and details of what goes into equity instruments, but more on the fact that a security or an (inaudible 0:14:36) or whatever, those common pieces of information have to exist in all these systems.

So that’s the picture, borrowed from Composite thank you, but it kind of looks pretty familiar to a lot of organizations and this is clearly, still we’re in some of this state and we’re starting to unwind out of this, but the key areas here is that complexity is extreme; volume and data complexity. We probably have over 4000 different data elements, so that’s pretty complex, with very sophisticated relationships. And if you look at the history of how Exchanges sort of, or trading or matching engines got created, the whole notion was about ultra low latency. So the design was not to cater to the reporting world or the business users, it was more to cater to small size messages and network traffic and fast ultra low latency, speed. So that s really a problem for a data architect; certainly been a problem for me. So how do we unwind this? For example Acapella does some really sophisticated stuff in coding their data. And coding your data is a form of compression. A lot of software companies do it, we do it too in our own way. In order for a user to understand what type of order it is, to get to cancel all sorts of stuff, you have to decode the information. Writing that in SQL is not so easy so you can’t really implement BI solutions on top of that. So just to give you the kind of complexity, so now we look for opportunities to leverage things like Composite and data virtualization where we can abstract some of that. Even if we’re not federating, there’s a lot of value in abstracting the complexity of the data.
Completeness and latency kind of go a little bit together in terms of the reference data story. These operational data stores are updated constantly. We need a mechanism that allows us to sort of augment the transactional data. We do have end of day processes typically like most organizations. We have inter-day processes in terms of pushing the data from an ETL perspective, but now we’re looking towards more streaming. How do we get this data from the operational data storage into the data warehouse where the SLAs require them?

Locality – we’re all over the world. This is sort of, it’s an old picture but it’s still very relevant. When I started with The Exchange, which was about six years ago, I basically put this picture together and I said this acronym called TORCA, you know they had complex data models, they used files for market surveillance, every single programmer had to sort of reverse engineer the business rules, understand the complexity of the business, I said that’s kind of insane; there’s a lot of cost in that.
Part of ROI is reducing your footprint for amount of knowledge transfer that’s required to get the business right. So I said why don’t we just come up with TORCA? Which is really the main entities no matter what exchange you have, no matter what type of trading venue we’re going to buy, there’s always trades and orders and reports that are the sides if you will, and there’s always tick data, which is the quotes and there’s some administrative messages that are going back and forth; that’s pretty typical. So let’s come up with this model and then let’s start to figure out how to apply this across to the various venues.
And the vision here is really, and this is what we’re building out, is to build a corporate information factory, which is what the CIF stands for, and this is in the virtual sense. So these different data sets, consolidated trades and the audit trail, which probably you’ve heard of if you’re in this business, and your orders database information, and the DLE which is coming from the NYC Classis which is the display book where electronically the matching happens; all of these complex data systems are storing information in their own way, but we are focused more on the optimization on how to get the data into those systems and then making sure that the common data points again, are standardized and consistent across all the models.
And then now we’re starting to focus in the middle here where we have the BI solutions inside of this, where we have some ELT capabilities inside of that, where we have data virtualization capabilities inside of that. So we’re building kind of a tool kit if you will. And then the bottom is kind of your views, your composite, how people want to see the data. One thing that’s important, I mentioned 4000 attributes it’s probably more if I actually go and do an inventory of all the different markets, you cannot present that from a BI solution or even from a Composite solution or anybody to make it really usable; so it’s not information. So you have to create subject areas. You’ve got to really focus and you’ve got to understand the business, you have to work hand in hand with business analysts who are technical, and that’s what we started kind of incorporating into our organization.
Hiring people who understand the business, who understand financial services, who are also technical that we can train and they can become data architects within our organization and they can liaise with the end users and the business communities. And we’re doing that and we find that’s been a successful formula. As a matter of fact, it’s been so successful that they stole three of those guys from my group and they’re part of the market regulation group now because they’re good at what they do. So this is important for the bottom tier where they’re really kind of doing the information architecture and they basically making it more simple to access this information. And then you can imagine that middle tier now starts to evolve with the data services tier, so it really becomes service oriented.

So we’re very junior when it comes to Composite, I’m not going to say that we’re experts, but we really are testing it. And we’re focusing on, performance is our problem, so we’re seeing what we can do when working collaboratively with Composite. They’ve been very good, their engineering staff is excellent. We’ve given them some ideas and they’ve come back and we’re evolving, hopefully we’re going to help them evolve their product and help us at the same time. In our evaluation we basically pushed some complex use cases, we didn’t go after traditional use cases because we know we can deal with them in different ways in terms of like smaller finance type of models, or basic metrics aggregation, but we went after things like where we know we have certain order information here and I’ve got certain tick data here and how do we basically align some of that data. And the performances are challenging to know matter what sort of technology you have so that was one of the key things that we put into our sort of evaluation plan.
So we came up with a check list, I’ll show you in the next slide some of the bullet points and some of the technology objectives. We started with a small team because obviously I think small teams are more agile and more nimble, you have better chemistry and you’re sort of in sync in terms of what needs to get done. So we had three and a part time person, so that’s why I say three to four that actually did the evaluation. We also had the help of Michal D’Angelo who’s very, he was great in the whole process. So we were again, new to Composite, we didn’t know all the various capabilities that the product could give us.
We compared, obviously we always as part of referenced architecture you basically have a matrix and you compare where this particular problem should fit. So we’re always looking at BI, ETL replication and federation in combination. Lot of BI tools have federation built into them, we’ve looked at them too, we believe that that’s really tightly coupling if we go down that road. So instead we want to have more of a connector based technology where its standard, it’s generic and then users can choose whatever BI solutions they want. So we don’t want to lock in so we decided against that.
Obviously the goals are important to the ROI so reuse, agility, a lot of this has been discussed, reducing the code, the programming footprint, third party integration so we can keep moving and licensing. We can reduce a lot of the licensing if we do things right. There’s so many different technologies at The Exchange that we have from a database perspective we have SQL server, we have OneTick, we have Times Ten, we have KDB, we have Oracle, we have different variations of Oracle, we have Netezza, we have Greenplum, we have My SQL and I could go on and there’s been custom solutions too. So it’s complex. I mean how many different drivers do I have to deploy to a desktop if I want to do federation? How about versioning of those drivers? You know Oracle comes out with a new release now I have to deploy a new driver. I mean there’s a lot of value in even just that abstraction and reducing the footprint of your code base and obviously licensing comes along with that.
Short term business value – the business is really interested in the end about productivity and being able to adapt to all the changes around them. So data availability is key to that. Simplifying access to the data, we get users who know nothing about European markets, but they want to do volume market share reporting. So we should have sort of a common business semantical layer that gives them what they need from a volume reporting perspective. They shouldn’t have to worry that the systems out there are using SQL server or other technologies.
Identify pilot project – we decided that you know what; the best place to start is let’s go after our transactional data. We’ll start small but we’ll go after a large section of the data and we focused on two areas. Capacity planning, which I talked about earlier which is all about optimizing our business and we did a little bit or work on the regulation, regulatory space. And the idea is that we would draw up these use cases, we would get some sort of confirmation from the business in terms of are these actually valuable to you if we’re implementing them and the success criteria around them? And we try to do this in a four to six week time frame with the training obviously included.

So the checklist really standards based, yes, we are all about too many different variables so we need standards. Meta-data driven – very key, we want to put the power in the business analysts hand, a technical business analyst. We don’t want to have a long development cycle where you need IT always to generate reports. We don’t want that hand holding, I’d rather have the IT folks focus on integration and volume, dealing with the big challenges of moving data around. Support multiple platforms, tools that increase productivity. Composite comes with a designer tool so it’s perfect from a data architects perspective. People are very familiar with mapping and modeling using Irwin like type of technology so extending that capability makes a lot of sense. We also have a lot of programmers in our group that build common services so an IDE like an Eclipse makes sense so those technologies come as part of Composite and we like that.
Flexible security – because we have so many different databases and we don’t have a single sign on right now and that’s very difficult for many different reasons, we need to figure out a way to move faster so people have different log-ins and different passwords, we want to push some of that security into the data virtualization tier. And we also want to have accountability; we want to know what people are doing. So we want to centralize all the metadata that becomes operational metadata for us, it becomes a way to fine tune our architecture.
Scaling – adding notes is important, we’re very grid oriented in the way we think. We don’t want to basically build monolithic systems, we want to build very horizontally scalable systems, but we also think about vertically partitioning our data. So we are like three dimensional data architecture if you will. So scaling is important so the software that we choose has to be able to scale. Data caching is key, it’s part of the reference information that we may want to put into it. Monitoring is also important from a timely perspective so we are proactive not reactive in terms of outages and quality of service. Intelligent query engine – we want smarts. We don’t just want, here suck out the data from here and suck out the data from here and then put it in memory, we know that’s going to blow up in our world. We want to have some multi pass capabilities, Semi-join does that in Composite and we like that feature and we want to continue to sort of see Composite involved with some of those optimizations. Hopefully we’ll continue to work together on some of that.
In terms of requirements, minimum impact on performance, we don’t want to add to much latency unless it’s a very complex federation. If we’re accessing a single data store we expect the result to almost be equivalent. And actually we may even see improvement because now with the Composite middle tire you can implement some connection pooling capabilities which gives you some value. Where now you have custom applications that have different methods for accessing data, now you’ve got consistent methods that are tuned and optimized, with connection pooling that you can actually see increase of improvements even without federation at all.

Cost allocation – once you record everything you can start to figure out how people are using system and eventually charge back if you need to do that. Impact analysis – schemas change a lot. Our business changes so much that we introduce new ways to make money. We create new business events translate into new transactions, our data model has to support that. We probably have maybe on the NYC Classic matching engine maybe about every two weeks we have changes. That’s not common in a lot of organizations, but it is common in ours and we live with that. We live with that change so we have to architect around it. Usage, building blocks I’m a big framework person. I believe in you build from the ground up. Start from your data, start with your computing environment and just kind of build these common services and then integrate these services and then find the right technologies where you don’t have to build, that you can leverage. And then make sure that everything sort of fits and version control.
So just to give you some samples of some of the evaluation use cases that we did. One of the use cases that we have is we have expensive technologies but they do their job, they’re very fast. And these technologies we have certain amounts of data that’s relevant and other data that might not be as relevant and it’s sitting in different environments. We want to union that data, we want it to look like it sits in one single place so as part of our evaluation we did vertical partitioning using federation. It doesn’t come out of the box, but with the help of Michael we were able to build custom procedure with the employment of a business calendar to each one of these databases that allowed us to do a Semi-join against this union and do pruning of federation. So by inspecting the trade date that comes through courier we were able to access data very quickly without having to move it. So we’re on the data this way, kind of vertically.
Summarization – so there are some key places in our ETL process where as we’re streaming data we’re capturing key statistics about the environment. And then we’re storing that. And users have a place, a starting point from a business intelligence perspective, and we can start there and then navigate down to the actual transaction level detail. In other cases where we need to recreate or use different filtering strategies, we’re using MPP systems for those types of aggregate type of reporting. Another key is the reference, is the Hub and Spoke using master reference data. So we have many dimensions, in this case our hub happens to be, it’s the reverse, it’s the transactional data and the spokes are all the various reference data systems, corporate actions, member administration, listings administration and we need different data from different systems. So we bring that data and we bring that data and are attempting now to start to kind of building virtual views of those systems and bringing it in real time.

So this is a very simplified model, but you know the goal really for us is integration standardization and simplification. So applications we want to have uniform connectors and JDBC/ODBC, pretty much most of our applications either use one or the other. Obviously there’s a whole web services or web based paradigm, but this is more of the thick client, even some thin client approach. Composite sits in the middle as the integration server, the metadata repository provides all the mappings, all the useful information allows us to sort of evolve very quickly and we build federated views, but very granular subject areas specific to solving specific business problems, not try to map everything. And the bullet points here is really about reducing data mar proliferation obviously, simplifying deployment of the software because we have so many tools and technologies and versioning of those tools and technologies, centralized security, data services tier is where we want to go and we’re starting to build those out and caching where we need to.

Here are some of the test cases in terms of what we tried. We did integrate successfully Netezza with Greenplum. In this case we actually did what we call latency. We have all the customer gateways where the trade traffic comes through, or the order traffic comes through, we actually have hardware that basically puts a timestamp on every message and now those messages get sent over to a Greenplum environment where they get trickle loaded every five minutes. That information is required for us to do that capacity planning and kind of proactive alert notification when we hit certain throughput thresholds. And it’s very complex because the amount of volume that we have to deal with is incredible. But some of the data is not all in Greenplum and this is the Greenplum latency systems. A lot of the data, in terms of the transactional and the historical stuff, is sitting in Netezza. So in some cases we need to basically join some of that information and union some of that information and that was that use case example.
Another example is just trying out traditional databases. We have a lot of applications that are being developed that kind of use My SQL as a cache store, and it’s possible that we need to integrate some of those cache data with Oracle which is our traditional relational database technology. Files and database technologies integrated, Netezza with XML, and Netezza with Netezza which is the case of the vertical partitioning that I mentioned earlier. So a lot of different flavors and we want to prove out is that we can actually support heterogeneous environment in that different types of data sources and different types of databases using different drivers.

The common business vocabulary and being metadata driven is obviously important to ROI, it’s important to being agile so that way we can empower the business, you know it’s just too long every time you have to sort of knowledge transfer from the business community to the business analyst to the developer and so forth and so forth. So we’re sort of moving more quickly towards these types of tools and this was very immature, premature, but now they’re becoming a lot more mature so we’re seeing a lot of value and we’re trying to build sort of a portfolio around these types of technologies that can allow for that. For example ETL, where you (inaudible), we have custom ETL solutions, grin enabled ETL solutions, but we’re moving towards a metadata oriented ETL solution.
User access – I mentioned that earlier. It’s important for us to sort of know and see what people are doing because a lot of times, from a data architect perspective, people say this is really important it’s very relevant, but realize that the number of hits or the percentage of uses for a particular data points or how they use it is not really valid. So we want to validate in terms of what’s really going on. We want to basically be able to go back and look at the information and say you know what, these objects, these columns, these relationships are really being utilized very frequently. And we can basically evolve and tune our data architecture or distribute our data architecture. And that’s part of the statistics and centralizes logging. So we can better troubleshoot our systems.

These are just some of the Composite things that we looked at and obviously tested. Procedures, and examples of procedures, I mentioned about how ARCA data is encoded, we basically wrote a function that exposed all those bits as a series of fields and now users can very simply write a filter, give me this and this or that and it made it much easier. Now you can add BI tool on top of that where before it’s not possible for us to kind of move forward with that kind of strategy.
Administration capabilities – we at the bottom over there. Certainly we’ve tested and even in a four to six week period we actually were able to test simple pass through query tool, Aqua Studio which you can now have to purchase, but very good pass through SQL tool. Deployed a connector to a bunch of users that had no knowledge of Composite, but now we showed them one kind of Composite view and they’re able to run their particular queries if you will. Aqua Studio seized the Composite repository as database and columns and object and columns so it looks like real metadata coming back from a database which is very valuable. Business objects we’ve integrated, all we which is the Oracle, business intelligence enterprise reporting tool. And then we have our own internally developed Java based query tool and we successfully integrated, and the goal really was to show that we can integrate all these various flavors of tools with different patterns of access using the same connectors.
And then some CIS features, the Composite Integration Server the ability to throttle is key to us because we’re dealing with huge volumes and to manage CPU efficiently is key, the queuing aspects are key, putting intelligence in terms of how it does connection pooling is very important. How it does the join, whether it’s hash, sort, semi-join, all that’s very critical because we’re dealing with such heavy duty data.

And those are just some key points in the intelligent query engine evaluation that we pointed out. Ability to push down because we are heavily invested in MPP systems, that we want where it makes sense to push some of that work down to the database, we’ve already invested the money there. And then in other cases do it in the middle tier. And data streaming is something that we’re going to keep building on. A lot of our ETL capabilities are all based on data streaming and single pass type of processing.

And there are really two main paradigms of access that we focus on at the highest level, which is either web services oriented or relational oriented and that’s how sort of we start from the top.

And security, obviously we want to move towards LDAP, but it’s a good sort of alternative if you have a middle tier that could provide you some of that sort of centralized metadata management around security and entitlement.

And just a picture of the tools, that I mentioned, from a productivity perspective these tools are intuitive, they’re pretty straight forward to use. We were able to leverage them very quickly, three to four man team we were able to sort of build out the first two systems that we deployed are a special report that we create for SEC called the Dash5 and the latency capacity planning type of stuff.

So our biggest barrier from an ROI perspective is really our technical challenges around the size and complexity of change in our environment. So that was key for us to choose technologies, the combination of technologies and how they integrate. The architecture basically that had to deal with the mergers, the compliance need of large windows of data and huge looking for a needle in a haystack type of problem and being able to support multiple lines of businesses using a shared data services environment. What we expect from an ROI in benefit perspective is we want to leverage existing IT investments, so we’ve got a bunch of things already built, we just want to formalize the integration and standards around them. We want to take advantage of the current data skill sets that come in. Obviously when we acquired NYC Life some smart people came with that, they understand their business processes, we don’t want to take over that function, but we want to basically standardize and integrate with those folks. We want to come up with sort of a reference architecture, a governance on top of that and sort of have a kind of PMO office that manages projects from a virtual perspective, but leave each lead, you know architecture lead and each development team to do their thing, but more sort of in line with the strategy.
And quick reaction to business needs, you may actually have to do things tactically, but at the same time in parallel. We want to basically start utilizing and evolving these, all these different patterns and architectures and software that I spoke about. And fast time, obviously to the solution, so we want to choose solutions where you know we’re not going to try it for a year, we don’t have the tolerance to do that. We want to make sure it’s mature enough and we can quickly handle some of these complex use cases from day two. And that’s it.
Let’s make time for a couple of questions since we ran over last time.
Audience Member
So one of the things I was looking to get a comfort level on today was around how Composite skills with increasing data. So can you talk a little bit about like how much data you’re passing through your existing permutation and how you think it scales, you’re experience with that?
Emile Werr
That’s a loaded question. Composite scales if you have the proper relationships built and it really depends on the filters that you use in your query. For example, we have a certain use case that we do, which is Netezza has all this information and Oracle has parent child orders. Our matching engine basically bypasses the child orders in one case, but they make their way through another system and they get stored in a different database. Our parent orders is really where the matching is occurring because things are being netted on the display book. Users want to see the relationship. Even though there’s a small, less than one percent of the time, that actually that relationship exists, the data is in multiple places. We actually did that use case and it worked out very well.
We had to iterively go back and it was kind of an architecturally tuning exercise working with Composite, but using Semi-joined and using some tricks in terms of you know tweaking maybe the data architecture side of it, we were able to actually do that, building some views to sort of be clever. So you can do that, and the volume to give you an example, we passed - Vitale you were involved in that, how much data in that one query? Yeah. But I mean again, think about it from a reference architecture perspective, we wouldn’t want to make that sort of a commonality. Obviously the way data is looked at in our world, transactional data basically is more of a pending type of reporting need as opposed to just augmenting reference data. Reference data with transactional data is fine, but we’re not going to be in the business of trying to take transactional and transactional and trying to kind of join then even though there’s multiple join operations and algorithms in Composite, which we really liked. And we saw many other products and none of them had all those capabilities, but we feel that there is, we tried the hard case and we were able to so that.
Audience Member
One more question. A lot of information delivered quite quickly so from one fast talker to another, one of the things I heard was early on in your presentation you talked about replication for large volumes of transactional data doesn’t work. But it wounded like later when you were in the extended data warehouse discussion, you still need something from the transactional system right, and so is that the concept in the extended data warehouse that I have most of the information I need, especially the stuff that’s static or aggregated or rolled up, but for those few pieces I do go back to the transactional system and then how do you mitigate that you might be doing something unwanted to the transactional system?
Emile Werr
Yeah. So the extended basically is really two main approaches. One is the reference information that we need to augment, we’re going to get it in place hopefully. The only decision that’s going to change that course or direction is if there’s data quality issues and obviously we’ve talked a lot about that. If there are quality issues that we need to do some scrubbing of reference data, we have no choice but to ETL that. If the quality is good, we’re going to get it in place. In terms of the transactional data, we’re not moving transactional data around. Instead what we’ll do, and we’ll do two things, one is either aggregate some data and ship that to each node and that kind of gives you a dimension in terms of an entry point from a reporting perspective. And the other thing is just align from a logical perspective, make sure that we have an alignment of the types of information across these transactional systems and append the data vertically.
Audience Member
You said something like there’s a lot of compressed data with which you use CIS to expand it and make it more user friendly. So how would you filter on the user friendly data, do you have something which translates it back into the compressed format or?
Emile Werr
Yeah that was a custom function that we built and we wanted to prove out the fact that it is a rapid application development environment, so out of the box drag and drop works great, but in our world we have a lot of complexity where drag and drop is just not going to cut it. We need to do specific stuff and the one thing that’s good about Composite is extensibility. You can add features to it almost you know in a short time frame, so it’s very agile. And we worked with Michael D’Angelo on that where he built us a function in Java that we put a business calendar, I mentioned a business calendar and we were able to say for this particular range of data joining to the business calendar and doing an outer join, Semi-join operation, we can prune, drop off some of those partitions. Those partitions for us become connections, so that’s how we did some of that. So some of this stuff can be done out of the box with a little training, some of the stuff has to be customized, but it’s an enabler. So we still use the same connectors that’s what’s important.