Smart Scaling for Web Traffic Surges

During the Little League® World Series, our site cannot go down.

This is an unequivocal truth for the Little League^® team. Site outages for a major organization are never good—but if they happen during your shining moments, when millions of eyes are on you? Disastrous.

Every year millions of viewers are drawn to the Little League World Series through multiple channels, including Little League’s website. While the main Little League site has consistent traffic, the World Series section of the site can see a surge of over one million page views in a single day. And their existing hosting was buckling under the load.

So how do you ensure that network infrastructure can scale to accommodate a traffic tsunami?

First, what we didn’t do: provision larger servers that Little League didn’t want or need to pay for and maintain for 11 months of the year. While over-provisioning servers is the age-old way of handling predictable traffic surges, we much prefer the smart and scalable way.

We’ll walk you through how to leverage the latest Amazon Web Services (AWS) cloud tech stack to create a system that scales to handle traffic as needed—and only as needed—then scales back down gracefully and seamlessly. The overarching strategy is to use scalable container clusters for unlimited reactive scaling and cost control, plus leveraging some persistent storage services.

Here’s the playbook—and the tech roster—we used to ensure a scalability win for Little League (too many sports analogies?).

Containers On Demand

Little League infrastructure runs on AWS. The setup takes advantage of container technology using Elastic Container Service (ECS) on a Multi Docker setup. With deployable containers it’s easy to add additional containers at a moment’s notice. All incoming requests hit a load balancer, which distributes the traffic equally amongst all available servers. And if any specific zone’s latency or CPU utilization—the two metrics we decided to focus on—hits a predetermined threshold, the system automatically adds additional servers and containers, and rebalances the load. We regularly tweak these thresholds based on site performance.

The Lineup

AWS Elastic Container Registry (ECR)
AWS Elastic Container Service (ECS) for unlimited reactive scaling and cost control
AWS EC2 Load balancer for managing traffic across multiple instances

Distribution & Diversification

We didn’t want to put all our containers in one basket, so to speak. While AWS doesn’t go down often, we have to plan for those few instances. To ensure that Little League’s sites were consistently available come fire, flood, or anything else, the servers were distributed across two different Availability Zones, adding additional redundancy in case of an AWS outage. If one zone shuts down, all data can run through another, and baseball fans will be none the wiser.

The Lineup

AWS Multi A-Z for redundancy

Managed Persistent Data Services

Managing persistent data on your own can take a lot of effort—maintaining backups, ensuring data consistency, and keeping up with other maintenance. Our Little League setup uses several managed AWS services, which keep the system safe with built-in auto-scaling, backup, and transparent updates. This means Little League can sleep well knowing their data is safe—no human intervention required. For storing content, we use Amazon Relational Database Service (Amazon RDS). And for the Cache, which helps speed up the site, we use AWS’s ElastiCache. Although you tend to pay a bit more for managed services, future maintenance costs are lower in the long term due to the built-in features.

The Lineup

Amazon Relational Database Service (Amazon RDS)
Amazon ElastiCache

Blazing-fast Image Delivery

Image management, storage, and delivery can be a major hassle. To take some of the burden off Little League, we used Cloudinary’s content delivery network (CDN), automating the entire image lifecycle and allowing the Little League team to manage images on the fly. The solution simplifies and automates the process of manipulating and delivering images, optimizing them for every device, at any bandwidth. And because it’s a CDN, Cloudinary geographically distributes images to edge caches, so Little League can deliver shots of the action much faster and with less stress on the servers.

The Lineup

Cloudinary CDN
Cloudinary Image Manipulation for in-CMS photo editing
Cloudinary Responsive Images for sending optimized images to the right device

Caching the Entire Field

A strategic coach positions the team to cover the entire field, from home plate to every corner/inch of the outfield. In that spirit, we use various levels of caching throughout the site. First, at home base, OpCache is enabled at the PHP language level so we’re getting code performance benefits. Covering the infield is ElasticCache, which is configured to cache heavy queries and output so we can quickly access them in memory instead of a full round-trip to the servers. Finally, the outfield is covered by AWS CloudFront, which provides a static cache of full pages and assets, providing the fastest possible page loads.

The Lineup:

Searching with Player Stats in Mind

Little League has some very large data sets, like years worth of game brackets, game scores, and player stats. This data is too big and complex to query with a standard relational database, so we called in a pinch hitter: ElasticSearch, a managed search engine service that takes care of regular maintenance, scaling, and backups. Using ElasticSearch, we indexed all of Little League’s historical data and now can query that large data set at lightning speed, with additional benefits like fuzzy search, query weighting, and more. With tens of thousands of records, ElasticSearch knocked it out of the park.

The Lineup

AWS ElasticSearch

Monitoring Rules and Regulations

The last thing you want is for your system to silently fail, then have 1 million fans screaming @ you on Twitter. So in addition to scalable infrastructure, we had to make sure we had proactive monitoring in place so we could react fast if a real issue were to occur. AWS CloudWatch keeps an eye on infrastructure metrics. If certain thresholds were hit, alarm bells were set off, automatically sending real-time notifications to an active Slack channel. We also use NewRelic for Application Performance Metrics (APM) to make sure our code performance is healthy. We also use BugSnag for exception monitoring for PHP and JavaScript errors, so we can proactively fix issues as they happen rather than waiting for a customer to notify us. With these monitoring systems in place, we are able to see immediately if something is wrong, and see how the system reacts to overall stress. We real-time monitor these thresholds and alarms—just like an umpire, making sure everyone is playing by the rules.

The Lineup:

Practice. Practice. Practice.

You might remember your Little League coach telling you that practice makes perfect. We don’t build a system without also testing the exact situation it needs to face—and the pressure it needs to withstand. So before the main event, we held a scrimmage against ourselves. Using an open-source tool called Locust, we created a load-testing system that could impersonate a wave of fans checking stats. Locust works by orchestrating a set of servers and then pointing them at a site to generate network traffic. We used AWS EC2 to create a swarm of servers and instructed them to hammer the site with our expected level of traffic. When the system stood up to the test, everyone was confident that we were ready for the main event.

The Lineup:

DevOps On Deck

Even with all the time in the world to practice and perfect, things don’t always go as planned. So despite our confidence in the system we built, we still do on-call coverage during the main event. We have all our metrics piped into a shared Slack channel. We also have email alerts and a DevOps team member waiting and watching in the bullpen, ready to be called in at any sign of trouble.

Batter Up!

The 2019 World Series came and went, and the servers handled the traffic tsunami, gracefully scaling from a few thousand visitors a day to over one million. When the dust settled, the site had served over 23 million page views during the event. Not bad stats, if we do say so ourselves.

Getting ready for next season

While everything went smoothly this season, we never consider a system 100% done. We carefully analyzed how the system performed and looked for ways to increase performance and reduce overall cost going forward. We noticed that some servers never hit their thresholds, so we were able to reduce the size and save more money over time while still maintaining the performance we expected. And having seen the overall on-demand cost over 12 months, we’ve been able to start proactively reserving resources—saving Little League even more. That’s what we like to call…a homerun. (Sorry, had to do it.)

Containers On Demand

The Lineup

Distribution & Diversification

The Lineup

Managed Persistent Data Services

The Lineup

Blazing-fast Image Delivery

The Lineup

Caching the Entire Field

The Lineup:

Searching with Player Stats in Mind

The Lineup

Monitoring Rules and Regulations

The Lineup:

Practice. Practice. Practice.

The Lineup:

DevOps On Deck

Batter Up!

Getting ready for next season

Align Your Organization & Technology with the User Journey

Boise State Website Redesign Earns CASE Gold

Learning By Ear: Our Engineering Team’s Favorite Podcasts