One of the key needs of customers implementing analytics solutions in the cloud is to move data between clouds. Indeed, most of the customers I’ve been working with after knowing Azure’s capabilities for data and analytics want to bring data from their current data provider (regardless it is a public cloud provider, such AWS or an on-prem datacenter) into Azure’s services to see how it works.
This post intends to explore a solution related to a scenario of Proof-Of-Concept (POC) brought by Americas University, a big university in an accelerated journey to the cloud. Actually, what we’re going to explore in here is only a small part of the entire solution proposed that, in the end, like everything else in the cloud, is comprised of lots of different pieces.
We’re going to bring data from an existing AWS’ RDS database into Azure SQL Databases by leveraging a data integration tool available in Azure called Data Factory. Let’s dive into it?
Nowadays, Americas University is facing a huge problem with their data state, meaning that, they’re looking ahead on market’s tendencies and turns out they’re seeing a couple of areas (among them, personalized learning, pro-active engagement with students geared by AI engines, and more) important for them, but not being able to tackle those out accordingly due to some limitations they’ve mapped when it comes to their current data state. They have featured the following major limitations on that regard:
- Hundreds of different solutions (both developed internally and bought from external vendors) with each one of those having its database.
- Siloed data, as the applications mentioned on above’s bullet, barely interact one each other (except by some APIs of integration developed internally).
- Security control is a huge challenge, as there are lots of different environments and stakeholders (employees and vendors’ operators) tied to it.
- Different data sources (Oracle, SQL Server, MySQL/MariaDB, PostgreSQL, MongoDB, and more) make difficult the management of the data itself and makes it harder to implement data centralization.
- Also, data is spread around different environments, including both public cloud providers and on-prem, which brings the complexity of hybrid data model.
Looking for ways on how to overcome situations like this debouching on an enterprise data analytics solution that could be able to centralize data, improve management and provide unified data models to foster AI’s engines, after assessing the environment and gather all the details, we came up with a proposal of solution (please, refer to Figure 1 for an architecture view), which part of it must show itself feasible through a POC.
The POC itself consists of proving to the customer the ability of Azure Data Factory on bringing data from AWS RDS (MySQL) into an Azure SQL Database (a Data Mart) in an automated way. A simple “copy data” operation would be enough for the customer to consider the POC succeeded.
The proposed solution
Figure 1 below showcases the ideal solution based upon the environmental assessment we went through with the customer.
Figure 1 is quite self-explanatory but a couple of bullet point can help clarify even more for those not familiar with the services we’re bringing in:
- As mentioned early on, the customer has a bunch of different data sources spread over different cloud providers (both public and private) and due to their new strategy of centralizing everything on Azure, we have to bring those data in. That’s why we’re going to need an extension of Azure Data Factory, called Self-hosted Integration Runtime.
- Databases sitting on Azure services will be accessed directly by Azure Data Factory.
- Azure Factory is going to drop off the ingested data into Azure Data Lake Storage v2 service. We need this because there will be documents being ingested later on as well.
- Then, Azure Databricks come in to play to do some processing and kind of normalize the different data arriving in Data Lake Storage. As result, the normalized data is sent to an Azure Synapses Analytics (former SQL DW).
- Then, Azure Synapses Analytics will allow them to perform the querying stuff with the high-performance required on top of the normalized data.
- Finally, we’re proposing Analysis Service for analytics data marts, providing data to Power BI clients, APIs or whatever other client type reading data.
There are several intrinsic details in here, but as we are going to focus on the POC side of the things, meaning, in this post we’ll see only how to prove to Americas University’s technical teams that Data Factory is a good choice for handling the ETL part of the things, I’m bypassing the details related to the other parts of the project.
POC: Bringing data into Azure
In order to prove the concept to Americas University’s team, we need to show them that Azure Data Factory in partnership with another Azure service (in this case we’re picking Azure SQL Databases) can drive data migration strategy by bringing data from a remote data source.
Their choice for this POC’s data source? AWS RDS. Why? Because they do have a bunch of solutions acquired from 3rd-party companies that use RDS as the primary environment for their databases, so from their standpoint, the value of the POC can be seen when we prove them we can bring that data into Azure in a first place. Makes sense, right?
The Figure 2 below summarizes the implementation we are about to do for this POC.
The AWS environment and pre-work needed
The data pointed out by the customer for us to import resides in a single relational database based on AWS RDS MySQL, as you can see below.
As you probably know, a best-practice well accepted for databases residing at any public cloud platform is keep those databases private, I mean, avoiding the exposure of them to public access over the internet, and fortunately for Americas University, the deployment of this database follows that pattern.
In AWS, the mechanism users usually take to block off the access on behalf of resources like databases, virtual machines and such are called “Security Groups”. Security Groups act in a very similar way of a firewall, where you go to configure both inbound and outbound rules, defining the protocol, port range, and source and destination for the request.
It means that, before we go do anything on the Azure side, we had to work both with Americas University’s IT team and the company that owns the database itself, towards allowing us to connect to that particular database through an EC2 instance at the same VPC. You will understand soon “why” we need an EC2 instance connected to the database.
The result of that work was the addition of one new entry (inbound) on the security group that bears “contentsearchdb”, as you can see through the Figure 4.
Ok, great. Now, our MySQL database in AWS is ready to accept connections from that particular subnet (10.0.1.0/28).
Next, we had to add an additional entry within the security group that bears the EC2 instance that directs the traffic to the subnet that holds the database, otherwise, the communication wouldn’t be bi-directional as we need it to be. For that purpose, we have added the outbound rule presented by Figure 5. The traffic back to the internet was happening already, as you can see.
Once we had the networking configuration settled, we were able to build up the EC2 instance we needed. I’m not going to detail this process here because AWS’s documentation does a nice job explaining it. So, please, refer to this content to get this step done. In the end, we were able to see our EC2 instance up and running both on the AWS Web Console and through the RDP connection, as presented by Figure 6.
We’re going to come to this VM soon, but now, as a next step, we have to get the Data Factory side of the history settled. So, let’s dive into it.
Deploying Azure Data Factory
Next, we do need to set up our Data Factory service in Azure. The easiest way to get it done is by taking advantage of the Azure portal UI, as described in this tutorial. So, let’s swap up over the portals to make it happen.
I went through this process myself, and after a couple of minutes of wait for the deployment get completed, I ended up viewing the new service’s instance blade, as displayed by Figure 7.
Next, we have to deploy an extension of Azure Data Factory called Self-host Integration Runtime but before we do this, I’ll dive into some explanation about “what” is it and “why” we need this component to make the POC viable.
Sometimes, Azure Data Factory has to ingest data from sources sitting in remote places. By remote places, I meant somewhere out of Azure’s datacenters, like external cloud providers (like AWS, GCP, and others) or even in an on-prem datacenter.
It usually happens due to the fact that those databases held by these places usually run under DC’s own network policies and restrictions when it comes to requests coming from the internet (we have already mentioned that this is a best-practice, actually so, nothing to complain).
Thus, in order to be able to bring the data in, Azure Data Factory has to rely on an additional component/agent to which it would be able to establish a communication with and, as result, acting as a bridge, connecting the Azure side to the databases’ provider side (in our case here, AWS RDS).
This is where the “Self-integration runtime” component comes to play. Basically, as the name suggests, it is nothing but a Data Factory runtime that can installed and configured within a Windows virtual machine. Because this virtual machine bearing the runtime will be deployed at the same network as the database, that communication can established both locally (at database level) and remotely (Data Factory level). Please, see the Figure 8 (focusing on the bottom) for additional clarification of this concept.
That’s why we needed an EC2 instance provisioned and up to local communication within the database’s VPC in AWS.
Ok, let’s configure the self-host runtime. As you might imagine, there are two sides to be configured:
- The Azure side: where we tie the runtime to an existing Data Factory.
- The AWS side: where we tie the Azure side to the runtime running locally in the EC2 instance we’ve created early on.
Let’s start with the Azure side. As I mentioned, the pre-requisite to deploy a new instance of a self-host runtime in Azure is, obviously, having a Data Factory already in place. That’s why we deployed it in the first place.
We are now ready to provision our runtime, then. We have basically two ways to follow to overcome this: 1) Through Powershell; 2) Through the Data Factory UI; I’ll pick the first option just because it is quicker than going through the portal. Both the procedures are very well described in here. Figure 9 showcases the success of that deployment.
Now, we have to go back to the EC2 instance towards to finish up the runtime configuration at that side, as described here. At the end of the process, we were able to see the screen presented by Figure 8, indicating that, the connectivity was successfully established.
Done. With this, Azure Data Factory can now access the MySQL database’s schema and perform the requested operations on top of it. Time to get back to Azure Data Factory UI to set up the last mile our work here, I mean, the work of copying data from the RDS MySQL into a SQL Database.
Configuring a “Copy data” operation
If yourself ain’t aware of it, Azure Data Factory has a dedicated web console, throughout which operations and pipelines can be created and managed. To have access to it, from the Azure Data Factory’s overview blade within the Azure Portal, give a click on “Author & Monitor” option, available in the middle of the screen, as highlighted by Figure 9.
This action is going to head you towards Azure Data Factory Management Studio (Figure 10). From there, we’ll be able to create our copy routine.
We are going to copy some data from the original data source (RDS MySQL) into an Azure SQL Database, as described by the architecture exposed through Figure 2, right? So, what we have to do is give a click on the “Copy data” option, as you can see by observing the Figure 10.
Then, the tool is going to present you to a Copy data configuration blade with a series of steps where, the first one, would be an initial description of the copy activity that will be performed. The information I’ve used can be seen in Figure 11, below.
Next, I was asked to configure a linked service for the source database. Here is were we point Azure Data Factory engine to the database available in AWS. I’ve selected MySQL as a database engine so the proper connectors will be selected (Figure 12).
Next, the tool asked me to get the data source’s connection properties set up. Figure 13 shows up the way this part ended up configured on my end. Please, observe that we have a piece of the configuration highlighted in red. There, is where we kind say to Azure Data Factory that we’re going to take advantage of a self-hosted integration runtime already set up, in my case, it is called “au-selfhosted-ir”. Now, because of this piece, and also, taking advantage of the credentials informed, the Data Factory service will be able to communicate with the MySQL sitting on AWS.
The tool will then look into the objects available within the data source to give you the chance to select what to bring. In this particular case, there is only one table, called “metadata” so, we’re going to catch it.
Now, we got to go after a very similar configuration process, but for the destination. First, we select an engine for what we are going to move the data to. In our case here, Azure SQL Database is the pick (Figure 15).
Then, the tool will give the chance to create a mapping between objects already existing on the target environment and the data source. In this case, we’re doing a first move, so I’ve created a table in called “Metadata” within the Azure SQL Database instance in advance so when we reach this point, we’re ready to proceed. As Figure 16 shows, the already know the source object. All you have to know, is to select the destination object and then, click “Next”.
The tool then gives you the chance to go through a column mapping process. Because I cared about creating a destination table that respects the same data types of the original ones, the tool automatically brings a suggestive mapping, that can accept or not. I went through some small adjustments and ended up that process the way Figure 17 displays.
Now, we go through an important final configuration piece. Basically, we are requested to define both “Fault tolerance” and “Performance” behaviors for the copying process. Definitions I’ve made can be seen through the Figure 18, below.
Next, the system is going to present you a summary about the configuration (Figure 19) that has been done. If everything looks good, all you have to do is to give a click on “Next”. If, eventually, you noted something wrong with that configuration, you can always edit those the sections by giving a click on “Edit” options, placed at top-right levels within each section.
Once you clicked “Next” at previous step, the system will then start copying data into the new database. As showcased by Figures 20 and 21, everything worked fine with this configuration process (at least this is what the tools is showing me).
If you want to double check the success of this operation by seeing the data sitting into the new database, just go to the new database context and verify it. I did it and the result was the one presented by the Figure 22.
Great! Concept proved and the global project can now move on.
There are some important aspects to be featured here. The major ones are listed below.
- Azure Data Factory as solution for data integration: every time I’ve been involved in a data project like this, Azure Data Factory has been playing a central role with success. The fact that it can be extended to connect to external data sources through integration runtimes and also, the high-availability of the service provided by Azure, makes it a viable solution. Also, it does bring native integration with the majority of the main data engines in the market, what brings flexibility to the entire process.
- Cost: as everything in cloud computing, Azure Data Factory can be very cost-effective is correctly configured. It is important to have a clear view of data volumetry, data traffic and such in the first place. It will allow you to properly configure the tool to not process data inefficiently.
- Performance: Data Factory has proven itself a very performant tool. In this POC we were able to move 105.827 lines between cloud providers under a minute using a basic configuration for this. There is an option available for you to increase the number of parallel processes and increase the performance for large scenarios of data moving.
- Integration with Azure Services: This is another important aspect as Data Factory easily gets itself integrated into other services in Azure. You could use it to ingest data and drop it off in Azure Data Lake, Storage Blobs, Databricks, Azure Machine Learning Studio and beyond. It does simplify (a lot) the work of moving data around.
- Distributed big data architectures not only viable, but simplified: solutions like Azure Data Factory simplifies the process of creating distributed big data architectures, as the data could be sitting anywhere and still, it could be brought together into Azure effortlessly.
- Appealing for education: data centralization and management is one of the great struggling points in education nowadays. Azure Data Factory can simplify drastically the process of ingesting, normalizing, and even processing data. Strongly recommend looking after it.
Hope it helps!