Recently, I wrote a blog post here at the blog featuring the rationale in which Ernani Martins (a fellow Customer Engineer from Google Cloud) and I got into to modernize Moodle, a 20-year-old application widely used across the World.
Today, I would like to present and discuss some results we were able to gather after stressing a generic Moodle environment that implements the architecture proposed by Modern Moodle on Google Cloud. The results were put in place, observed and collected by a Google Cloud partner in Brazil, Calriz in partnership with myself and Ernani and then finally shared with Google Cloud.
Most of Moodle’s environments we’ve seen out there (over 50%) in Latin America are considered mild, which means they have up to 5K simultaneous users consuming resources at a given time. Usually, this is a case for 15k to 20K users in total. There are bigger scenarios (for instance, a customer in Brazil, alone, hosts over 1 million students within Moodle with several customizations and much bigger concurrence), obviously, but they’re just not that common.
That’s the reason by which we decided to concentrate on the initial set of tests for both 1K simultaneous users first, and then evolve the test to 5k simultaneous users. By doing so, we’re making sure the solution does cover most of the scenarios in the region.
We’re in process of testing the architecture for bigger scenarios as well and I will share the results when they’re available but I can tell you right now: Kubernetes elasticity combined with scalability of Google Cloud services are playing very well together.
Creating the test plan
To perform the tests we used Moodle’s recommendation and took advantage of a utility the tool offers for that purpose. It is accessible both through the command line and web interface, as depicted by Figure 1.
Next, we selected the size of the test we wanted to run (in our case, 1000 simultaneous users over 6 loops and with a rumping rate of 100), then a target course, “Becoming a cloud engineer” (which is a full course) and finally, created a new test plan by hitting button “Create test plan”, as presented by Figure 2.
Moodle’s test plan leverages JMeter, an open source and very popular tool for performance tests. If this is the first time you’re reading/hearing about it, I strongly recommend you take a look at the content provided by the link above.
The action depicted by Figure 2 will end up generating a test file that can be imported from JMeter’s interface and run from there. In case you want to play out with JMeter or even reproduce this test, a tutorial describing how to install JMeter at your local machine is available here.
In order for the test plan tool to work, in Moodle, you need to set $CFG->tool_generator_users_password in config.php. It is an adjustment we needed to make at the container level.
The procedure outlined in this section was copied and redone to generate the test plan for 5K simultaneous users. Moodle’s test plan generator tool only supports up to 10K simultaneous users, however, you can update JMeter’s test file yourself for a bigger scenario if you would like.
As mentioned earlier, we performed the test against a Moodle environment implemented according to the architecture described by Figure 3.
Below you can find the infrastructure we used to perform those tests. Figure 4 showcases the infrastructure in its peak.
- Private Google Kubernetes Engine (GKE) cluster with minimum of 3 and maximum of 24 nodes over virtual machines of type n2-standard-2 and limit of 256 simultaneous pods (1 core + 1GB of memory each).
- Cloud SQL instance with up to 8 vCPUS, 32 GB of RAM and 10 GB for storage.
- Filestore with 2.5TiB for storage in Basic SSD tier.
- Memorystore for Redis with 2 GB of RAM.
Results for 1K simultaneous users
First, we wanted to identify the maximum capacity (load) a pod would be able to support with the configuration pre-defined above (1 core + 1 GB RAM) so that we could extrapolate that number to predict infrastructure needed for future scenarios.
Upon execution of the test, we learned that a single pod could hold up to 50 concurrent users (it varies a little bit depending on the type of operation being executed).
Pods auto scaling process triggered automatically when memory usage broke the limit of 80% of utilization (that’s a recommendation towards preventing generalization of slowness across the platform), behaving well therefore, respecting the auto scaling rules previously defined. Figure 4 shows peaks for the pod.
For 1000 concurrent users, instead of 20, 24 pods with the same configuration were allocated to support the demand (following the variation trend mentioned early on). Figure 6 shows CPU utilization (averaging 40%) for Moodle’s pods under stress.
Figure 7 shows memory usage (varying between 50% and 60%).
Figure 8 illustrates Cloud SQL memory utilization (pretty stable around 3.5 GiB – 3.7 GiB).
Overall, it was pleasant to see the application behaving very well with 24 pods and auto scaling rule set to 80% (both for CPU and memory usage). We observed an extremely high rate of successful requests (98.02%) against 1.92% of fails. Among the errors, the vast majority (over 20%) is related to loading Moodle’s initial page (probably due the large amount of asynchronous calls performed by JQuery under-the-hood.
Also, the initial page has shown the higher response time rates. Login and logout processes also presented some slowness, as evidenced by Figure 9. Yellow refers to the initial page, green refers to the logout process and dark yellow refers to the login process.
Results for 5K simultaneous users
Pod-wise, the test for 5K simultaneous users followed the trend noticed for 1K, meaning, both CPU and memory utilization kept consistent overtime. For 5K users we had an average of 120 concurrent pods under the same auto scaling rules established for 1K.
Cloud SQL didn’t demand any changes in its current infrastructure. We ended up seeing that it could actually be a level below to support both scenarios. Cloud SQL Memory utilization has never exceeded 12 GiB of usage, as you can see through Figure 10. Cloud SQL’s CPU utilization never exceeded 5%.
We noticed though that Memorystore for Redis requested more processing power. It did hit 100% of CPU utilization several times and kept around 1.6 GiB of memory utilization thanks to an eviction policy we set up with infrastructure deployment. Figure 11 shows the evidence.
In response to that need, we escalated Memorystore for Redis to the next tier available. As of today, that’s the recommended approach for those scenarios.
We are working to enable a separate read and write replicas (PHP configuration is needed) as an alternative solution in the future, but it is not available yet.
The adjustment allowed us to succeed in the test, which ended up with an incredibly high rate of success for requests, 99.84%. Only 0.16% fail.
The scenario also showcased a couple of additional aspects:
- Response time for requests rarely exceeded 3000 milliseconds.
- Throughout averaged 20 requests per second.
- Moodle’s official throughput recommendation is 120 requests per minute. The current architecture delivered 20x more efficiency, delivering 1200 requests per minute.
- We observed a number of cold starts for the pods (due to the limitation of physical resources in the cluster. Ideally, clusters would need to be either horizontally or vertically scaled to avoid those situations).
- With 5K users it became clear that forum utilization demands a lot from the infrastructure.
As stated earlier, we keep creating new test scenarios to guarantee everything works properly under different circumstances. However, the above results allow us to extrapolate numbers to predict eventual demands in terms of infrastructure.
For instance, in normal conditions, for a 10K simultaneous scenario, based on the above, we would need around 225 concurrent pods (Figure 12) with auto scaling rules set up with a threshold at 80%.
With that, it is possible to predict cost as well and establish a cost per user per hour (imagine you are charging your customer on a “per user” basis). I mean, you know the number of users supported per pod, you know the total count of users, you have a view about cost for the infrastructure. Isn’t it only a matter of doing some math? That’s governance serving cost effectiveness.
The performance tests we went through with the architecture proposed ended up showing very useful insights about its current implementation model (which is very satisfactory from a performance perspective).
We continue to work to make that architecture even more effective moving forward so that more institutions can benefit from it. If you want to contribute and help us build the best environment to host Moodle out there, hit the solution repository in GitHub and send over your pull request. We’ll be happy to review it and accept it.
That’s it. Cheers!