In the IT sense, “integration” is a funny word. At EC Wise, we often claim to have experience undertaking “integration” projects, but when I think about it, it’s one of those things where there is “no there there”. What we call integration is not something that organizations do for its own sake, it’s something they do in the context of some larger project goal. We have undertaken data integration projects to support goals like building data warehouses to drive analytics and bringing data from partner or customer systems into a client system for processing.
One of the challenges of integration is that often data you need is controlled by proprietary systems licensed from vendors who are not eager to support extraction of stored information for delivery to a system they do not also control. In such cases, schema information about the database is not available, nor are there publicly available APIs that allow access to the systems or their data. There are times when extraction of data from such systems becomes so daunting that even we have to throw up our hands in despair. Sometimes, however, given storage in a database or other amenable form, it’s possible to externally interrogate the data, figure out what is there, and develop extraction strategies that actually work.
The starting point in making this work is to perform a small number of actions in the product of interest (after obtaining an account in the product of interest, of course), and then to analyze the impact of those actions, or better transactions, in the product’s database. In one situation we found ourselves in, the product we wanted data from used Microsoft SQL Server as its database. Fortunately for us, Microsoft SQL Server is fairly transparent, and provides robust “Change Data Capture” (CDC) functionality.
CDC functionality, when enabled in a database, records insert, update and delete activity and represents it in tables. To see what happens in the database when we create some transactions, we would enable CDC on the tables, and query the CDC views after we run them. By doing this for a while, with a fairly wide variety of transactions, we are able to develop a relatively in-depth understanding of the database schema and how various transactions are represented therein. And by the way, while CDC is nicely built-in to Microsoft SQL Server, we have our ways to get CDC updates from other databases we work with.
Once we know what data we need setting up automated CDC on an ongoing basis is pretty straightforward. I’m not going to go into a lot of detail (which you can find in this Database Journal article, and in Microsoft’s Developer Network), but in SQL Server that means enabling CDC on the database and on the tables you want to capture changes from. When you do that SQL Server creates a table in which changes and metadata about them are recorded. Under the covers, it populates these tables directly from the transaction log. To enable access to this information, it generates a set of “Table Valued Functions” you can call from your code, which we then program to run periodically from a job scheduler, or use something like Camel to actively check for updates.
In our scenario, we wrote a fairly thin API over the generated TVFs that allowed code to pull a set of changes in a table for any given time frame. Our objective was actually to get the data into an Oracle Data Warehouse, so we wrote those changes into flat files, and then loaded them into staging tables in the Oracle database. Additional jobs run on a scheduled to refresh the data warehouse fact and dimension tables from the change tables in staging.
To do integration projects effectively, you have to be creative, and that’s something they folks on our team have going for them. We’re all about getting things to work together, even if they were not originally meant to. Feel free to contact me for more information about our integration capabilities.
— Tom Spitzer
Leave a Reply