In 2018 I accomplished 16-years working professionally with technology. It is a considerable amount of time, isn’t? Starting in 2010, I had the opportunity to work directly and indirectly with a large number of projects involving Azure. Most of them? Apps migration from on-prem to the cloud (we’re still there but, this is a conversation to another moment). What I couldn’t imagine at that point is that all the experience was accruing until that moment, it was going to be extremely useful into one of the most challenging migration projects I had ever worked with. This post is all about sharing the experience I had by leading that project.
I was working as Senior Technical Evangelist at Microsoft in Brazil when I was assigned to help a big customer (Secretary of Education of Sao Paulo – SEDUC) to migrate customer’s most important application from their local datacenter up to Azure.
The project itself was all about moving a massive web application which, at that point, used to serve 4 million unique users a day from a local government’s data center up to Azure. The application itself was written in .NET (to be more specific, ASP.NET MVC 4 with C#), and at that point, application’s data was sitting on top of an Oracle RAC.
The system (which was built from scratch by SEDUC’s technology department) was designed to cover up all the main aspects related both to academic, pedagogical and operational areas inside the public schools all over Sao Paulo State in Brazil, therefore becoming a complete, and at the same time, complex ERP solution. Some of their goals with that tool include:
- Simplify, optimize and improve the agility of the administrative procedures within the institutions.
- Enable parents to chase out kid’s school life in so many different aspects, and at the same time, do measure their learning efficiency/accuracy.
- Do promote digital inclusion.
- Do generate a unique knowledge base to collect insights by analyzing different parameters related to the teaching/learning approach adopted and using it as a critical mass to make strategical decisions going forward.
- Make the process to generate official documents more agile and straightforward for students and their relatives.
Some numbers about the environment at the time we started to assess it towards to project the ideal environment on Azure.
- Five million unique hits a day.
- Seventeen million logins a year.
- Twenty-seven thousand schools actively using (in a daily basis) the system.
- Ten millions of students registered keeping data since 1997.
Why did they move?
Three big reasons for the change, feature below.
1. Local datacenter wasn’t able to provide cloud resources required by the application.
Because the data center in which the application was running on at that point wasn’t ready to offer important functionalities at scale required by mission-critical applications (like auto-scaling, disaster and recovery, fault-tolerance, and beyond), towards to afford the high-demand often faced by the application, lots of virtual machines used to be manually allocated to overcome that. Tons of problems started to raise up as result of this. Just to mention some of them:
- Frequent outages caused by not having automation in place to support application’s demand.
- Often times environments over-provisioned after peaks (also generated by not having automation implemented).
- No automation at all.
- Missing technical expertise from DC’s technical team to build up very specialized environments towards to extract the best of the technology stack being used.
- Backup and disaster recovery strategy totally based in manual procedures.
- Relatively frequent data loss by not being able to store the enormous amount of data generated by the application out-of-box.
2. Extremely high costs
The amount of money spent every year by Seduc to keep up that challenging environment running on was huge, really huge. I can’t give you the exact number here for a compliance reason, but I can easily tell you that it was around of the dozens of millions of reais (Brazilian currency that, by using the conversion rate to US dollar back then, should be around US$ 1,00 = RS$ 2,8), and yet putting in perspective that we’re talking about a governmental institution where the way the resources are consumed is extremely relevant, it becomes a even more sensible aspect.
I still remember one of the first talks I had with Seduc’s CIO where he said: “Fabricio, this new architecture on Azure needs to be efficient, robust and foremost, extremely cost-effective. I won’t be completely happy with this project if I pay more than 50% compared what I have today”. What such a challenge!
SEDUC had a huge (and very noble, in my opinion) goal by targeting to be aggressive on saving money; They wanted redirect that money saved towards to accelerate innovation of top of their platform. They wanted to implement IA, for example, to better predict important pedagogical aspects and students behavior and to get there, they would need to invest out both in technology and specialized skills to accomplish that. They were also looking for optimization within the environment (implementing DevOps, for instance) because they knew that it could save developers’ time to deliver more business value rather to spend too much time fixing errors in production.
How did we assess to identify challenges and make technical decisions?
There is no formula at all for an application/environment migration to the cloud. It usually happens because there are so many variables tied to it which makes each migration process differ one of each other. Experience working with cloud overtime shows us that, still, the best way to figure what a given migration process looks like is by having an accurate conversation with the right people inside the customer. So we did it.
As a result of that interview process we were able to understand among so many details, more about what kind of migration we were talking about and also map out everything about environment’s complexities related to it. Besides that, we were also able to accurately track out engineering efforts (code work) needed to make the application compatible with the business requirements.
Some questions I often bring with me and which I’m always trying to leverage within conversations to project’s stakeholders are listed below.
- Is it a new application being developed (to be developed) or an existing one? The answer to this question will give you an idea of how to approach the solution. Overall speaking, applications in the development process are recommended to go to PaaS services so it could give you an insight in what kind of services on the platform as s service side you could pick up to support it in production. If the answer to this question is “Existing application” you’ll need to go deeper to collect more details prior to suggesting something. SEDUC’s project was an existing app, so we needed to go further.
- In terms of technology stack sustaining the application. Can you describe it including framework’s versions? Is it written in Java? C#? Python? Node? Ruby? Having a clear vision about the set of core technologies for a given application being migrated is crucial. The answer to this question combined with that one provided to question 1 will give you a good idea about the environment itself and also, what kind of offers could you leverage on an architectural proposal. For SEDUC the answer was ASP.NET MVC 4 + C# + HTML5 + CSS 3 + Oracle Rack and some APIs SOAP and Restful also written in C#.
- Does the application has environment’s customization? For instance: Is the application tied to either a specific version of IIS or Apache or Operational System? From a cloud migration project standpoint, it is mindful to understand details of the application being migrated. This is because essential constraints can be raised in such a way to change completely the way you were targeting to build it up the architecture on the cloud. Back to SEDUC’s project, we were able to identify some significant constraints which made us slightly change the initial approach from a 100% PaaS to a mix between PaaS and IaaS to meet the business rules.
- Does the database run on a dedicated server? Does it take advantage of any feature available in particular versions of the product? It is pretty standard to found out scenarios where the database takes advantage of specific features available only in a particular version of a product. Also, it is pretty essential to understanding the way customer has built that database. For instance: For whatever MySQL Database is running on-prem or another cloud, it also supposed to be runnable from Azure MySQL Databases, right? Not necessarily. At this point (and I’m pretty sure this will change shortly), Azure MySQL Databases doesn’t support MyISAM Storage Engine, which means that objects inside the database built on top of this approach, will not work. To make it happen, you would need to make a conversion from MyISAM to InnoDB, which is natively supported by Azure service. As mentioned before, the application which we were migrating used to rely on Oracle RAC as the data repository, and there was a bunch of queries specifically designed to extract the best performance of the RAC, and we needed to rewrite the whole thing to SQL Server on Azure.
- Does the application requires static IP to properly work? Here is a potential deal breaker for migrations which intends to rely 100% on PaaS services. Why? Because most of the PaaS services delivered by the public clouds rather of having a public IP tied to them, they do have DNS names attached. It usually happens because these services are dynamically allocated and managed by the clouds and once the management is internal, the clouds do the IP/DNS assignments dynamically. The potential deal breaker here resides on the fact that tons of applications run under harsh firewall rules and usually access to those applications are allowed over specific IPs. Of course, there are several different ways to approach it, but it keeps being a potential risk factor for PaaS migrations. It was one of the constraints we’ve found by talking with the customer on SEDUC’s project.
- Does the application/service needs to be auto-scalable or would you need to manually scale up and down? Here is another essential aspect to be mapped. Depending on how the application’s users behave, you might need to have the ability to auto-scale the application’s underlying infrastructure (or at least parts or it) automatically. Usually, applications which require it have random access and not a distinct pattern of load landing into it in such way that, because you never know when your users will flood out your infrastructure, you need to be prepared to go up and down automatically. On the other hand, applications which have a familiar access pattern could opt by manually scaling its resources (although being possible doing it automatically as well). Back to SEDUC, we’ve decided scale automatically part of the infrastructure. Other components, we’ve kept the manual management.
Other helpful questions (and we used it on our conversations with SEDUC) would be:
- There is any backup/DR currently in place? How does this work?
- Does the application rely on File Server? If yes, it uses NFS? SMB? Other?
- Does the application required a considerable amount of IOPS for data storage?
- Does the application relies on only one database or multiple?
- What about deployment. How does you team deploy updates into the application’s environment?
- Does the application uses CDN? Which one?
- Does the application uses any cache structure (like MemCached, Redis, and so forth)?
- Does the application requires to not being exposed to the internet?
- There is any security pattern which the application needs to run underneath of (like PCI, for instance)?
Migration of critical mission applications usually are not simple. Sometimes they are, but based what I have been seen so far, this is very rare. Especially when it comes by moving something from bare metal (which is the case for Oracle RAC) to virtual machines on top of a public cloud, like Microsoft Azure.
Below you’ll find out the featured (but not all of them) challenges we found at that point after extensive conversations with customer’s technical teams. Please note they weren’t only technical changes.
- Four terabytes database was sitting on top of Oracle RAC with nearly 5 thousand queries specifically written to extract to best of the RAC. It was a challenge because we were not moving Oracle to Oracle. We were requested to migrate the database from Oracle RAC to SQL Server which necessary meant we should rewrite the whole thing.
- Application querying the application directly and ad-hock.
- External job was daily doing a “data load” work, bringing data from the mainframe to the RAC automatically.
- Environment extremely customized and wide usage of extensions to support application’s routines.
- There was no file storage in place. All the files arriving into the application used to be stored locally, meaning, at the application server level.
- No cache, no CDN, no scale strategy.
- Monolithic ASP.NET MVC 4 application which wasn’t designed to be hosted in a distributed environment (meaning being auto-scalable, fault-tolerant, load-balanced, and so forth). Some engineering effort should be applied towards to get it “broken”.
- ASP.NET sessions spread all over the application were persisting at server level important information about users, institutions, operational procedures, and so forth, which was making the application stateful. We would need to add some engineering effort on top of it.
- At that point Azure only offered classic deployment model, which made the process of distributing the application all over the cloud services a bit more complex due the limitations bound by that working model.
- Both application and data needed to be placed at Brazil Region which, back then, was very limited to the variety of the resources available.
- The whole migration should be done within 90-days. This is because the agreement with the old provider was about to expire, and the decision to go up Azure had been made only a few months before the project start, which meant that, if the project weren’t migrated to Azure within the opportunity window, the system would go down by not having an environment to run, and it would critically impact SEDUC’s reputation as technology provider, once all the schools in Sao Paulo state used to rely on that application to get the enrollment process done.
- We only had October, November and December to effectively get the migration done and as you may know, we do have Christmas holiday at the end of December, which means that we would lost some days on it.
- Build up an environment on Azure that would be cost-effective, robust and, at the same time, offering an satisfactory performance.
Finally, after several days going deep into application’s architecture and working model, we were able to establish some certainties and base upon it, define the migration approach and execution flow.
High level conclusions about the application
- Overall, good architecture but several adjustments would need to be done towards to get the application scalable and distributed as it supposed to be.
- By writing some code (engineering effort) Storage Blobs and Storage Queues would be viable.
- By writing some code, Azure Redis Cache would be viable.
- Because the application was strongly tied to a specific version (IIS 8) of the container web and its features, and also, because we wouldn’t have enough time to change the application to be more generic, Web Apps were not an option at that moment.
High level conclusions about the database
- Because there were lots of integrations between Oracle RAC and SEDUC’s mainframe, we’ve decided to go ahead and create a worker role (under de classic resources) to do it for us.
- Also, to make the migration from Oracle to SQL Server be doable in a short period, we agreed that each schema on Oracle RAC would arrive into SQL Server as a new database.
- Because SEDUC needed to have 100% of control on the database environment and also, because at that point Azure SQL wasn’t supporting native resources for SQL Server on-prem, we decided to build up a SQL Server cluster on Azure on top of virtual machines with SSD disks.
A team of thirty people (between SEDUC’s technical team, partner’s team, and Microsoft’s team) worked together to get the migration done, and I personally had the opportunity to lead it.
We started redoing the database. Basically, database specialists recreated an Oracle environment on Azure to move the database from Oracle RAC and have a “working environment”. Then, they were able to do analyze every single query designed to the RAC and started to transform it out to a dev/test environment with SQL Server.
In parallel, the development team started to look for every single part of the application where changes would be needed. Primarily there were looking for portions of code which were generating sessions at the application server memory level and replacing it for code moving these sessions to Azure Redis Cache. Beyond that, they were wrapping up the whole process of upload and moving it to Blob Storage.
At the same time, database specialists and developers were working in partnership to redo the queries within the application, once they changed completely to support SQL Server.
What hard work. Fortunately, after 90-days we were able to get the application up and running on Azure. At that point, some issues were identified and the team kept the hard work to get everything fixed.
After a couple of months, the application was running correctly and performing very well and cost-effective. The final version of that application (which is currently running on Azure) is about 70% less expensive than its on-prem version.
How did this project evolve?
As I mentioned earlier, the migration to Azure was the first wave. A second wave was started immediately after (meaning weeks after) the movement was completed. The second wave was comprised of:
- Application’s refactoring to run on top of Web Apps (App Service Environment).
- Database refactoring to support Azure SQL Databases.
- Implementation of continuous integration and continuous deployment for both dev and production environments through Visual Studio Team Services (VSTS).
- Queries performance tuning.
- Application’s code quality measurement by using Sonar Qube.
Because this project was so successful, SEDUC has moved lots of other projects into Azure afterward and nowadays, they are one of the most important customers of education in Latin America.
The migration itself has has became a case for Microsoft and it is currently available through the link below (in PT-BR). Enjoy!
Impact to the customer
Projects like this always bring tons of learning and, of course, impact for those involved with it but especially for the customer and, from now on, I want to use this section to talk a bit about the impact generated by this migration within SEDUC.
First, the migration taught SEDUC’s internal teams that everything is possible in terms of technology if you are up to make it happen. If you carefully think about it, this migration seemed unlikely on the beginning. Why? The application was currently running on bare metal servers and data sitting on Oracle RAC. How could we beat it in a performance/efficiency standpoint by leveraging virtual server into a public cloud? The answer? Commitment to optimization. Of course, there was a lot of work involved here; however, we’ve made it work due to the responsibility of the technical teams to make it happen. Also, every single aspect of Azure services was exhausted towards to provide the best performance available, and from that moment on, SEDUC’s team were pretty confident that everything they wanted would be made on Azure.
On a business perspective, the project showed to SEDUC that Microsoft could be the right partner to support them on the Digital Transformation journey (and it was just beginning). Later on, SEDUC moved more than 100 applications up Azure, including 90% of those which used to run only within mainframes. This movement made SEDUC become one of the most relevant Azure customer’s in Latin America.
Overall, because SEDUC moved almost 100% of their application into Azure, nowadays they spend at least 50% less on the infrastructure side if compared to when they used to host it by themselves, and beyond, being much more effective due the automation implemented within Azure.
Pretty cool, hun? I’m so glad for being part of it, and realistically, to witness how a well-planned and executed cloud project can make customer’s life better and agile.