Integration isn’t a simple topic to talk about. This is because, over the years, different methodologies and technologies were made available in the market to bring some sort of facilitation to a, by nature, very complex task – make systems talk to each other.
While REST APIs has been establishing itself as the “most appropriate way” to integrate applications, there are some scenarios (legacy applications, for instance) where moving data around would be more appropriate in terms of efficiency.
So, the concept of not only responding with portions of data to a given HTTP request but rather, moving data mass between systems to foster internal engines is currently known in the market as “Data Integration”.
In practical terms, Data Integration (often referred as DI) is a big building blocks structure, comprised of several elements like data ingestion, pipelines for cleansing, processing, transformation, ETL mapping, so on so forth.
Because the most important outcome for what Data Integration exists is the ability to deliver an “unified and meaningful view” about usually different types of data sitting on various data sources, having a definition (contract) that describes “how” that data should look like once it arrives in the destination, is key for the success of the entire process, once it standardizes the integration point.
This standard, contract, manifesto, you name it, has been defined by the Open Data Initiative (which Microsoft is important part of) as Common Data Model (or simply, CDM). This article focuses on bringing a comprehensive view about CDM itself, where to apply it, and what Azure’s services would be suitable for this kind of scenario.
What is a Common Data Model?
Common Data Model is, therefore, a shared data language for business and analytical applications to use. The Common Data Model metadata system makes it possible for data and its meaning to be shared across applications and business processes. It brings natively:
Usually, CDM includes a set of standardized and extensible data schemas (collection of predefined schemas includes entities, attributes, semantic metadata, and relationships).
Extensible in this context means that you have de ability of enriching the standard definition of CDM by leveraging either your own definition or taking additional existing definitions made available in the market by somebody else, like the ones defined by NIEM in the US, and many others.
Below’s figure presents an interesting schema that describes the general idea of why you would need to have a CDM in place.
You can easily realize the main goal here, right? I mean, this is a “common data model” receiving data from different system’s (data sources) and, somehow, serving data with the same structure to systems with completely different purposes. That sounds promising, doesn’t it?
In Microsoft’s technology stack, CDM is the foundation for what the company calls Common Data Services (or simply, CDM). CDS is part of Power Platform suite. It’s a built-in service that allows you not only storing data coming from many sources but mainly, share that data with other applications within the Microsoft’s ecosystem, like Dynamics, O365 and others.
Picking up a real-life-like scenario
Americas University (AU) is a big university with presence in multiple countries. The company holds multiple applications for different both technical and business purposes. On the business side of the house, one of these various solutions they are currently using is Office 365, mainly (but not limited to) for employees collaboration. Also, as part of their current set up, they have just joined Power Platform to start creating internal business apps to serve different departments within the institution.
Figure 4 presents Americas University’s website.
As a first effort towards to create a set of specialized internal apps for departments, Americas University’s marketing department would like to have a closer view about people’s sentiments when it comes to evaluating university’s content, courses, events and operations.
There are different data sources currently in place for the institution which could be leveraged to provide the insights that the marketing’s team is seeking for, but they decided to start small and picked only two of that: Twitter and Americas University’s blog comments.
In practical terms, this is what the IT technical team will need to implement to deliver the expected value to marketing’s department:
- Collect data from AU’s Twitter account;
- Collect data from AU’s blog post’s comments;
- Standardize the data so that different applications within the company can read that particular piece of data from a unique data source regardless where and how it is coming from;
- Apply machine learning to call out sentiments automatically in a personalized App;
Understanding the role of CDM
In above’s scenario, CDM is going to be key. Why? Look, we’re not talking about an ad-hock communication model (which is often covered by regular API calls). As the scenario describes, there is data analysis being performed on top of the data coming in from different sources, and therefore, an unified version of that data will foster out another internal systems, which does characterize a data integration process.
CDM will fulfill the need of unifying data, meaning that, every instance of data coming in (regardless if it is a tweet or blog’s post comment) will need to obey a well defined structure: Common Data Model.
From the moment AU’s team have the decided for having a CDM in place on, whatever new entrant data source (a new social media, for instance) will be able to provide data following the pattern previously defined and the consumer at the other side, won’t have to change anything in its configuration to start seeing the new data, just as described by Figure 2.
Know before you go
Early on, I mentioned that CDM is a shared language to describe data, right? But, how does it lands? What does it mean in real world? I think it is important, at this point, understand a couple of concepts before to start building up something.
Common Data Model bears five important ideas/concepts. Together, they do describe technically what a CDM is composed of and give us the idea about what is necessary not only to create something new but mainly, of how to extend out existing CDM structures.
- A way to describe the location and shape of data records that are stored in files. This is all about CDM’s internal organization. Basically, there are two different aspects in here: 1) How data is distributed under a given CDM (also know as CDM folders); 2) A “manifesto act” file that organizes and kind “links” everything together.
- A collection of reference entities that have been published on GitHub to represent the most common shapes of data that customers may find in the business application ecosystem. This is all about pre-existing entities populated by ODI in CDM’s Github. You will be able to extend stuff from there.
- The metadata used to describe the logical concepts, compositions, and semantic meanings for, and relationships between, standard published entities or a customer’s private standards or ad-hoc compositions. As we are working with data, must be a legible way for us to describe relationships, meaning for those, and such. The idea here is put together a way to describe those aspects.
- An object model library. Relying on the previous definitions, it does manage (creates, update and read) CDM folders and its content.
- The ecosystem of applications and services. The ecosystem makes use of some or all of the four capabilities listed above to help app developers work cooperatively on standardized entity shapes or share metadata.
Out of the concepts mentioned early on, there are several other aspects that are worthwhile mentioning.
CDM structures are defined in JSON format. The rules that validate the CDM’s JSON files is what they call “CDM’s language”. Below’s example is going to give an idea about what we’re talking about.
Conceptually, is the top-level of a CDM object structure. It contains all the folders and underlying objects representing that given CDM object, as illustrates the Figure 5.
Because CDM’s objects can reference documents both from within the same corpus and from out of it, there must be a way for us to easily specify the document we want to bring in, and there is. “The way” is called Corpus path and utilize the following path structure.
In above’s path structure, worthwhile put some comments on the “storage” label. To be short, it does mean the “place” where the CDM’s document/object you’re referring to is sitting at and also, kind of automatically maps de respective adapter to that location.
- cdm: is reserved to mean the storage adapter that accesses the standard documents from the root of Microsoft’s public GitHub site (remember when we talked about the “extension” aspect of CDM?).
- local: label is often used to access the documents under a particular folder on a local hard drive.
- adls: label might point to a folder in a storage account.
Figure 6 (retrieved from Microsoft’s documentation) give us a good visual reference to understand the relationship between the different “Corpuses” under a given CDM.
As you can clearly see through above’s image, there is a map in place for each storage label. In this case, respectively:
- local: points to
- cdm: points to
- adls: points to
Important to note that any relative corpus paths are assumed to be relative to the document in which they’re found, and therefore are assumed to come from the same adapter source.
Other “must have” concepts
Also, to fully understand all the concepts and technical aspects tied to Common Data Models (and I strongly recommend you doing so) like, version information, standard definitions, imports, and more, please, refer to this article in Microsoft’s documentation.
Also, you should definitely look into the following articles towards to fully understand the CDM’s fundamentals:
From now on, I will be focusing on the implementation of the Common Data Model for our scenario, both conceptually (defining the structure) and practically (by landing this in one of the options currently available in Microsoft’s cloud platform: Common Data Service (CDS).
However, I won’t do this here, in this article. There is a lot of information for us to digest before getting our hands dirty with the implementation so, let’s take a break.
In the next article, together, we’ll build from the ground up the CDM that que will support AU’s demand, described early on, so, stay tuned!