Sunday, March 9, 2025
spot_imgspot_img

Top 5 This Week

spot_img

Related Posts

Continuously Improving Developer Productivity at Snowflake


People often ask me, “Why did you join Snowflake, and why did you choose to work on developer productivity?” I joined Snowflake to learn from world-class engineers and be part of the highly collaborative culture. These have been the secret sauce to Snowflake’s rocket-ship growth. Snowflake was embarking on a remarkable transformation of developer productivity, and I had to jump on the rocket ship as it was taking off!

Let’s start from the beginning. In 2012, Benoit Dageville and Thierry Cruanes founded Snowflake. Experienced engineers in the database space, they wrote a lot of the code that is still powering Snowflake today. Our founders believe every engineer should write code, regardless of seniority. Being hands-on allows our senior architects and CTO to see the on-the-ground reality of developer productivity and more accurately understand the daily struggles, feasibility of timelines and solution trade-offs.

Snowflake experienced rapid growth and hired top talent, resulting in a quickly expanding codebase. As a one-product company, Snowflake’s product inevitably became more complex and feature-rich over time. Consequently, over the years, our test collateral grew unchecked, the development environment became increasingly intricate and build and test times slowed down significantly, negatively impacting developer productivity. Additionally, the infrastructure supporting our systems was unreliable under heavy load, requiring manual retries and frustrating developers with lost productivity. Despite early efforts, centralizing and scaling our tooling proved unsuccessful.

In early 2023, we kicked off initiatives to improve continuous integration (CI) and developer environments, but they made only slow progress. By late 2023, the strategy required a rethink to get it moving.

We identified four key ingredients to the reboot: leadership priority, measurement, customer connection and accountability.

Leadership priority

Our leadership defines developer productivity as a high priority for the entire company, with monthly CEO and executive leadership team reviews. Snowflake’s CTO, S Muralidhar, is deeply involved in defining company-wide goals and leading cross-company teams doing the actual work. Developer productivity had a head-count surge to staff up the team. In the meantime, volunteers across the company stepped up to refactor code and improve test setup and run times to accelerate the replatforming of our build system and developer environments. 

Measurement

Measurement is key to understanding whether we are making progress. It’s vital to measure success as perceived from a customer’s perspective, and not our ability to hit internal milestones. In order to do so, we collect a combination of qualitative and quantitative metrics:

  • Product operations metrics: build system uptime; total number of build system commands per day

  • User-perceived product metrics: build and test latency; percentage of workspaces that are healthy; adoption of new systems 

  • User-journey output metrics: average PR count per developer per week; PR lifecycle latency through each stage from PR submission, first review and last review to merge

  • Overall developer sentiment: quarterly CSAT (survey-based); top struggle areas

We set ambitious quarterly goals based on metrics and keep a close watch on them. In our weekly operations review, we deep-dive into anomalies in metrics and drive to resolution. When metrics are not sufficiently connected to the developer experience, we revise them and collect updated user feedback to ensure consistency with users’ lived experience.

Customer connection

Our customers are discerning Snowflake engineers. In order to earn their trust and build the virtuous cycle of feedback and iteration, we need to demonstrate that we 1) understand their experiences, 2) are making steady progress and 3) reprioritize and iterate based on their feedback.

When pushing for changes, our customers fall into three categories: early adopters, followers and laggards. Early adopters are key to giving critical feedback to help us iterate, and they promote innovations to their teams as trusted insiders to drive adoption. We identified early adopters as champions in each team and conducted regular interviews to keep a pulse on progress and sentiment. 

We analyzed user feedback in quarterly surveys as a key tool to prioritize quarterly planning. We conducted frequent user interviews across product groups, analyzing by seniority, primary programming languages and geographical locations. We used this feedback to fine-tune priorities in monthly sprints. As each geographical location differs in its team rhythm and preferred mode of communication, we went to each feature area team to understand the best forum in which to engage with the local developers. And we always closed the loop with customers to share delivery progress of their top pain points and get their feedback on the roadmap.

Accountability

We treated developer tools like a customer-facing product with published service-level agreements. We demonstrated accountability in weekly emails and biweekly updates in product all-hands. Transparency helps build customer trust and keeps feedback flowing. 

We were able to deliver progress through measurement, validate improvements through customer connection and demonstrate accountability with updates to our customers. The focus accelerated the development and adoption of four innovations: Bazel (build system), cloud workspace (dev environment), SnowCI (continuous integration) and Snowfort (regression test framework). Each of the four innovation areas allowed us to overcome technical and organizational challenges.

Bazel

Our legacy build systems, Maven and CMake, led to a proliferation of build scripts and inconsistent tooling, steepening the learning curve for developers working across different components. Local builds were unreliable and required frequent retries and “clean builds” to succeed, which resulted in high frustration among developers. By adopting Bazel, we achieved consistent, reliable and fast builds. Additionally, remote backend execution on Buildbarn allowed us to massively parallelize build and test, drastically reducing latency.

Cloud workspace

The old developer environment on MacBook VMs struggled to meet the growing demand for building and testing the product. Users were limited to one active VM at a time, and resetting a broken VM could take 30 minutes and sometimes much longer. To address this, we introduced ephemeral cloud workspaces running on Kubernetes, preloaded with the latest checkpoint code, cached builds, IDE, testing tools and anything else the user would need for instant productivity. A user can spin up a workspace in two minutes and maintain several workspaces in parallel, with fast intra-data-center connection directly to the Bazel Buildbarn cache. 

Dev environment migration is the most disruptive change to dev workflows. Driving adoption took more than a great product to overcome the inertia. We leveraged the early adopters as insider champions, used fun and creative outreaches to motivate the laggards, went team by team to showcase the benefits and closed the loop with round after round of feedback.

SnowCI

Historically, our CI system was powered by sprawling, disjointed Jenkins scripts that had grown over the years. Fixing and optimizing these was challenging due to poor testing capabilities and limited team understanding of the system’s entire configuration. By building a clean-slate, YAML-based system internally, named SnowCI, to power our CI workflows, we forced a cleanup of these scripts and moved to a much more understandable, conventional and modern CI system. This has allowed us to consistently measure the stages of CI, delete large portions of custom, unreliable workflows and focus on speed and reliability as our core principles.

Snowfort

Our legacy test framework was created over a decade ago, without isolation to allow tests to start and end in a clean state, without guardrails to prevent inconsistency and without abstraction to prevent duplicate boilerplate setup code. This is why we built Snowfort — Snowflake’s regression testing framework — to address these shortcomings. It also allowed us to break large test collateral into small batches, making it possible for fast retry with test failures. Buildbarn allowed us to parallelize test execution, reducing CI latency. Test isolation allowed us to drastically reduce test flakiness, improving CI reliability.

Progress

Between February and December 2024, adoption of the four new systems increased from 10-20% to 90-95%. Build is now much more reliable and stable at eight-minute latency. CI latency and reliability across different stages of the pipeline have improved three- to sixfold. The ephemeral developer environments spin up with fully cached builds and tools in two minutes, fully preserving settings between sessions. The Snowfort regression test framework makes it easy to author reliable tests with proper isolation. As a result, developer sentiment has significantly improved. In nine months, the percentage of dissatisfied customers has dropped from 58.9% to 21.4%, and the percentage of satisfied customers has increased from 17.8% to 42.3%.

Our work is not done. We have much work ahead of us to make developers’ inner-loop experiences friction-free and even more delightful. We are investing in Bazel and IDE strategy to have sub-minute local builds and fluid authoring and debugging experience for every programming language at Snowflake. We are investing in SnowCI to complete PR from authoring to merge in less than twenty minutes. With the four key ingredients mentioned above in place, we are well positioned to continue to serve our customers.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Popular Articles