At Google, the future is multi-arch; AI and automation are helping us get there

At Google, the future is multi-arch; AI and automation are helping us get there

In this edition we’re hearing from Partha Ranganathan , VP Engineering Fellow and Wolff Dobson , Developer Relations Engineer, who are driving the transition of Google's monorepo towards architecture neutrality for production services. After the launch of Axion - our custom Arm®-based CPUs - Google's production services quickly needed to compile binaries for both x86 and Arm at the same time. This was a huge challenge, so the team pursued a multi-architecture approach for a variety of reasons:

  1. All of the code used for production services is visible in a vast monorepo.
  2. Most of the structural changes we need to build, run, and debug multiarch applications are done.
  3. Existing automation like Rosie and the recently developed CHAMP allows us to keep expanding release and rollout targets without much intervention on our part.
  4. Last but not least, LLM-based automation will allow us to address much of the remaining long tail of applications for a multi-ISA Google fleet.

To read even more about what we learned, don't miss the paper itself. And to learn about our chip designs and how we’re operating a more sustainable cloud, you can read about Axion at g.co/cloud/axion.


Google Axion processors, our first custom Arm®-based CPUs, mark a major step in delivering both performance and energy efficiency for Google Cloud customers and our first-party services, providing up to 65% better price-performance and up to 60% more energy-efficient than comparable instances on Google Cloud.

We put Axion processors to the test: running Google production services. Now that our clusters contain both x86 and Axion Arm-based machines, Google's production services are able to run tasks simultaneously on multiple instruction-set architectures (ISAs). Today, this means most binaries that compile for x86 now need to compile to both x86 and Arm at the same time — no small thing when you consider that the Google environment includes over 100,000 applications!

We recently published a preprint of a paper called "Instruction Set Migration at Warehouse Scale" about our migration process, in which we analyze 38,156 commits we made to Google's giant monorepo, Google3.

To make a long story short, the paper describes the combination of hard work, automation, and AI we used to get to where we are today. We currently serve Google services in production on Arm and x86 simultaneously including YouTube, Gmail, and BigQuery, and we have migrated more than 30,000 applications to Arm, with Arm hardware fully-subscribed and more servers deployed each month.

Let's take a brief look at two steps on our journey to make Google multi-architecture, or ‘multiarch’: an analysis of migration patterns, and exploring the use of AI in porting the code. For more, be sure to read the entire paper.

Migrating all of Google's services to multiarch

Going into a migration from x86-only to Arm and x86, both the multiarch team and the application owners assumed that we would be spending time on architectural differences such as floating point drift, concurrency, intrinsics such as platform-specific operators, and performance.

At first, we migrated some of our top jobs like F1, Spanner, and Bigtable using typical software practices, complete with weekly meetings and dedicated engineers. In this early period, we found evidence of the above issues, but not nearly as many as we expected. It turns out modern compilers and tools like sanitizers have shaken out most of the surprises. Instead, we spent the majority of our time working on issues like:

  • Fixing tests that broke because they overfit to our existing x86 servers
  • Updating intricate build and release systems, usually for our oldest and highest-traffic services
  • Resolving rollout issues in production configurations
  • Taking care to avoid destabilizing critical systems

Moving a dozen applications to Arm this way absolutely worked, and we were proud to get things running on Borg, our cluster management system. As one engineer remarked, "Everyone fixated on the totally different toolchain, and [assumed] surely everything would break. The majority of the difficulty was configs and boring stuff."

And yet, it's not sufficient to migrate a few big jobs and be done. Although ~60% of our running compute is in our top 50 applications, the curve of usage across the remaining applications in Google's monorepo is relatively flat. The more jobs that can run on multiple architectures, the easier it is for Borg to fit them efficiently into cells. For good utilization of our Arm servers, then, we needed to address this long list of the remaining 100,000+ applications.

The multiarch team could not effectively reach out to so many application owners; just setting up the meetings would have been cost-prohibitive! Instead, we have relied on automation, helping to minimize involvement from the application teams themselves. Read more about the analysis here.



AI-driven code migration is reshaping scalability. Impressive to see how automation enables multi-architecture evolution.

Like
Reply

The idea or think is how to make everything as a consumer with collaboration and AI to make solutions for everything and sustainability in this way you can solve all world problems

Like
Reply

Important Topics Newsletter : I invite you to visit my new Weekly Newsletter Bulletin https://www.linkedin.com/newsletters/globally-market-news-7349475298716839936

Like
Reply

This is quite an evolution in Google's software engineering model, keeping the monorepo benefits for collaboration and dependency management while adding modular, architecture-aware builds. 👏

Like
Reply

It's interesting to note how Google's multi-ISA project (x86 + Arm) tends to benefit modern and well-optimized frameworks much more, such as those developed in C++, Go, or C#, which are closer to machine language and take advantage of updated compilers. Now, regarding more legacy tools with older code, there could be a performance problem.

Like
Reply

To view or add a comment, sign in

More articles by Google Cloud

Others also viewed

Explore content categories