Successfully reducing Cocoon's build time from 24 to under 10 minutes by implementing efficient tooling, upgrading NextJS, aggressive caching, and reorganizing CI processes
We recently got our builds at Cocoon down to <10 minutes from 24 minutes. It took us about 2 engineers 5 days worth of investment to do this, which was also the only amount of time we got allotted to try and make it faster.
We thought it was worth investing the time because research shows that every 5 minutes saved on build times reduce time to merge by 1-2 hours, which can be quite significant. Also, working in development at Cocoon had recently felt like working in molasses, which is the unfortunate result of us shipping a little too quickly without prioritizing the fixing of technical debt.
Since we only had 5 days to make things better, we had to pick and choose the right strategies to get the most bang for our buck. We called this Project Turbobuild. Here is the combo of strategies we used to do this.
- Better tooling
- Our back-end tests unfortunately touch our Database a lot. While we can nitpick each test case that touches the DB one-by-one to make each individual test faster, since we only had five days, we wanted a better strategy for making the entire suite faster without having to invest a lot of time per test to make things faster.
- We looked into using Knapsack Pro to load balance and parallelize our test suite better. Because it dynamically balances the tests while the workers are running, we can actually quite evenly distribute the back-end tests so they all finish around the same time. It was a bit difficult to set up with Jest, but my friend and co-worker Nick was able to figure out the magic combination of commands and env variables to get the Jest version working with our test suite.
- In total, a step that took 24 minutes got split into 5 workers, and with the right config, it got shortened to 8 minutes per worker. This was by far some of the lowest hanging fruit with the highest impact as part of Turbobuild.
- Upgrades
- We upgraded NextJS from 12 → 14, which was a huge speed boost because 14 uses SWC underneath the hood, and because there’s more aggressive tree shaking when hot-reloading and building the app.
- We went from needing to build/reload around ~30k modules to ~3k modules. Still far too many for an app of our size IMHO, but I will take a 90% downsize and a 3-4x speed boost from an upgrade that took 2-3 days to do max.
- It did shave around a minute of build time in our CI step, which was not the biggest jump from the list of things we did to get the test suite better. But at least for this step the real win was the vast local build time improvement.
- Cacheing
- Nick started aggressively cacheing different parts of the CI build, which shaved around 3-4 minutes. He did this because we tended to build our app before using it in end-to-end and integration tests, and you can actually cache certain parts of the build using
actions/cache@v3
(link). - This is something we’ve wanted to do for awhile but didn’t know the right cacheing strategy and which parts of the build to cache without more time to research. But with the five days we had, we finally had enough time to actually try it out.
- Overall, he was about to shave around 2-3 minutes by cacheing different parts of the build more aggressively.
- Reorganization
- We realized we had some blocking steps in CI that didn’t need to block other steps. The more parts of the build you can start parallelizing sooner, the faster the overall test suite would be.
- We found that there was one major step that didn’t need to block another step - so removed the blocker and let it rip.
- By removing 1 blocking step, we were able to speed up our build by another 3 minutes.
All-in-all, removing waste is the golden path in true TPS methodology. I think usually saving 1-2 minutes is honestly negligible in the grand scheme of things, but if you can almost make a process 3x faster and remove 50%+ of the time waiting, it is a GOOD 5 days worth investment, especially when it comes to engineers’ time. This is what our build steps looked like at the end of the experiment:
Our total build time across 12 parallelized steps is now <10 minutes per build. I suspect this will increase each engineer’s leverage by tightening the feedback loop, which will be great for shipping things 😄
I think a big lesson for me here is that it is absolutely worth a week of engineering time to invest in keeping the shipping process buttery smooth, especially while we don’t have dedicated product infrastructure engineers. And also to be thankful if your start-up or company has the money/bandwidth for product infra engineers (you’ll be surprised at how many small companies just don’t have the money for that kind of role, especially if you’ve never worked at one).
I also think we were also able to execute so quickly because of my past experience doing some product infra stuff on the side at Gusto, and for acutely feeling the pain of slow build times while doing product work at Cocoon. At the end of the day, at a start-up, you should be encouraged to fix your own problems and have the empowerment from your manager and team to do just that. I’m lucky enough to work at a place right now where my team trusts me to “let me cook” and do things like this.