Couple months ago I started a new series of posts here about Common Data Models (CDM). The idea with that was to starting some exploration around this new technology brought by Microsoft into the market throughout its data services in Azure, and where the benefits of it resides.
The first post that introduces the concept of CDM and does suggest a fictitious scenario for us to land down the concepts related to it overtime, can be found here.
Today’s post is all about laying concepts down, meaning, start working in practical terms with CDM (taking in consideration the scenario described in the first post we’ve got). We’re going to:
- Understand what is currently in place for the data we need to normalize.
- Define a pattern
Remembering the big goal
As stated in the previous post, Americas University’s (AU) marketing department wants to apply lots of different analysis to data coming in from the external audience. To get started, they want to understand sentiments tied to user’s comments throughout its various online channels: Twitter, Facebook, Instagram, Blog, website, so on so forth. Later, that data could also be shared with other internal system in different departments within the institution.
For that purpose, they soon figured that, the very first step to be performed towards to accomplish that goal would need to have a standardized data model and here, is where Microsoft’s Common Data Model approach comes to the scene.
They have also decided that, as a Proof of Concept (PoC) they’re going to assess both Twitter’s and Blog’s comments. So, this is what we’re going to tackle from now on.
Understanding how data is currently captured and stored throughout the various systems
Americas University’s official blog – one of the data sources selected for a PoC – currently sits on Azure and responds to blog.americasuniversity.net (go ahead and give it a try). After assessing the environment, we were able to figure that it is WordPress-based and data is being held in a MySQL database, also in Azure.
Figure 1 showcases the structure for the table that holds comments data into Azure Databases for MySQL.
When it comes to Twitter, there is no data being monitored within AU’s boundaries. What they do have currently for social, is a bunch social dashboards on top of third-part marketing tools to see “what’s happening”. Twitter included.
In order for us to standardize and analyze the data, somehow, we need to capture Tweets (initially monitoring specific hashtags), bring the tweet in and store it somewhere. So, let’s start solving this problem before hand, and then, move to the CDM definition afterwards.
Monitoring and storing tweets
There are several ways to monitor Twitter events, grab the information needed and send it backwards. Considering that Americas University’s team in charge of this doesn’t want to spend that much effort developing some custom routine for this, and also considering they have Azure available for “free usage”, we’re going to recommend the leverage a service called Logic Apps to accomplish this.
As Azure Logic Apps has a pretty comprehensive documentation, I’ll leave with the responsibility of, on not knowing what we’re talking about, get to know at least, the basics of the service.
Also important to know at this point that, as we are working in time of PoC, the hashtag defined by marketing’s team to be monitored in Twitter (it will serve us as a trigger for the collect process) will be “#americasuniv“. This is because there is a marketing campaign running out with students encouraging them to use that hashtag.
We have also made a decision of storing Tweets coming back from Logic Apps’ flow into a container of Azure Cosmos DB with SQL API enabled. Why Cosmos? Well, the specifics of that decision is a matter for us to cover in another post in the blog.
Figure 2 highlights what we’ve got for this event reactionary piece of solution gluing the elements we mentioned early on.
Here is some high-level explanation for the architecture presented by the Figure 1:
- A Twitter’s user post some content using the hashtag “#americasuniv”.
- Logic Apps is constantly monitoring Twitter and looking for some content tagged with #americasuniv expression. Once it happens, it triggers the Logic Apps’ flow composed by several internal tasks, ending up with the recording of the “tweet workload” in Cosmos.
- Cosmos DB then is updated with the new data.
The Figure 3 below showcases the working flow we’ve got in Logic Apps.
Next, to make sure this process is in fact happening, we can take a look in the history of executions, provided by Logic Apps. It should show up all the runs that happened in the past minutes or so. Gladly, Figure 4 is showing us couple runs, which means that our flow is being performed as expected.
Great! Data is now arriving into Cosmos DB as it suppose to be. Of course, there are details of configuration in here both for Cosmos DB and Logic Apps but, as these two topics aren’t the focus of this post, I’ll leave it behind for now enabling us to move forward and reach out the CDM piece of the solution.
But before we dive-in into CDM, I would encourage you take a look at the Figure 5 so you will be aware about the format of the workload we’re recording in Cosmos DB.
Defining a Common Data Model for AU
As we already saw in the first post of the series, the value of a Common Data Model sits on setting up a data pattern in such way that different applications can take advantage of a centralized repository of data.
If that is the case, the CDM itself has to be planned in such way to cover all the common data pieces between the systems involved in the ecosystem it serves (or will serve) to.
In AU’s particular case, we already know that the ultimate piece of this project is to provide a solution that will be able to extract sentiments, record them into the model and make this available to the whole set of applications that needs to consume it somehow. So, our CDM has to ready to cover this up.
The very first thing we did, consisted on analyzing both databases (blog and tweets) to identify similarities. Because it consists of two very simple data structure, this job was pretty easy to accomplish. As we start to expand the analysis to multiple systems in the future, it could get considerable complex, though. But for now, let’s keep it simple.
Below you can find some definitions we were able to make for our still small CDM.
- At this point, to proof the concept, we only need to have one Entity, which we will name “Comments”.
- Comments is going to hold only five attributes:
- [id][string]: Identifies the registry as unique within the Entity.
- [postAuthor][string]: Comment’s author name.
- [postDate][string]: Date when the comment was originally posted.
- [postMessage][string]: The comment itself.
- [postSentiment][string]: The sentiment related to the content of the current post returned by Sentiment Analysis API.
- As we are keeping it simple for now, we are not going to create any neither Virtual Entities nor Aggregated Attributes.
- We are going to need to leverage the CDM’s structural files listed in the Figure 6, below. You can download those files in here. These files are know as “schema documents”. They bring tons of definitions already settled by Open Data Initiative (ODI) so you can either use them or build on top of it if that makes sense.
- Following CDM’s directories list pattern, locally (meaning, in a dev station) we’re going to adopt the following structure to our CDM:
- clickstream (that name can be whatever makes sense to you)
- Import file, Manifest file, Entities files
- Foundation files
- clickstream (that name can be whatever makes sense to you)
Great! For now, we’re settled in terms of definitions e initial configurations to our CDM local modeling. If you have it in place already, you should be ready to move forward.
Next, we are going to need to create the following CDM files within the directory “clickstream” defined early on:
Comments.cdm.json: this is the JSON file that is going to bear the definition of our Comments entity. Please, refer to the Gist below to see its definition.
Important considerations regarding above’s implementation:
- Line 2: Points to the file “schema.cdm.json”, in CDM’s root folder. This is the file that sets up lots of definitions for CDM’s schema.
- Lines 4-11: Imports the files “foundations.cdm.json” and “meanings.cdm.json”. They are Corpus files used to determine data types and meanings for CDM’s entities and attributes.
- Lines 12-40: Attributes definition. You can understand those attributes as columns in a table in the context of a relational database.
clickstream.manifest.cdm.json: this is the file that will serve as entry point to our CDM. It means that, whatever storage service that is going to rely on this CDM’s definition will use this file to “discover” everything under-the-hood.
Important consideration regarding above’s implementation:
- Lines 6-12: This is where we are setting up the entity we just created to be “discoverable” following CDM’s pattern.
Done. That’s all we need for our PoC with CDM. Next, let’s validate what we did and see if that CDM would be workable from any compatible Azure service.
Validating the CDM built
Wait, how do we validate if a given CDM being built is valid or not? That’s a great question indeed!
In order to help with that demand, Microsoft makes available in GitHub a Common Data Model Navigator Tool. By leveraging this, you will be able to validate and explore CDM structures, both those already made available online by ODI, as well the ones you eventually build locally (our case) by simply uploading it into the tool. So, this is what we’re going to do towards to validate our creation.
To make it happen, just browse through the link provided above and then, on the top-left menu, give a click on the button “Load from files…”. It will allow you to select your CDM’s root directory (schemaDocuments).
Next, the tool asks you for a confirmation about uploading all the files under that given directory. I must confirm to move forward (Figure 9).
Next, select the manifest file that describe the CDM we just built (/clickstream/clickstream.manifest.cdm.json). That action will trigger the load action of CDM’s internal elements into the visual explorer. From that moment on, you will be able to both validate and explore internal elements (Figure 10).
Done! CDM is ready.
I will stop there. There is a lot more to come when implementing this “in production” and this is what we’re going to do in the third (and final) post of the series.
For now, I’ll leave you with what we have thus far, which is a lot more to digest, hopping that it will help you to tie down the concepts we discussed in the first post of the series.