Extended CV

Hello! Thank you for your interest in my CV.

Each entry in my CV is expanded upon in the form of a lighthearted blog post :-}

There are way too many words in here! These are in part notes for myself (so I don’t forget!) and in part a detailed account of what I accomplished. Maybe click on something on the table of contents that you find interesting.

Some details and numbers are intentionally left vague in the interest of my previous employers.

As part of the Kubernetes Task force, contributed to Hopin’s initial k8s set up, its migration from Heroku, and its hyperscaling ambitions
Wrote the k8s-sample-app, an app that documented & standardized how to release new services at Hopin
Drove our AWS cost categorization & reduction effort that saved Hopin about 1 million a year
Created the second iteration of our k8s based compute infra to host hopin.com
Other accomplishments: simplifying TF modules, creating log based SLOs, writing Prometheus exporters and helping junior SREs

As part of the Kubernetes Task force, contributed to Hopin’s initial k8s set up, its migration from Heroku, and its hyperscaling ambitions⌗

As soon as I joined Hopin, I asked to jump into the most exciting project they had, which was this one, a cross-team project to bring the Hopin monoloith from Heroku to Kubernetes on AWS.

The infra stack for this consisted of:

Terraform with terragrunt for the clusters, using our own module for Kubernetes
Github actions to apply the terraform (to multiple AWS accounts)
Fluxcd to sync the cluster components and their config through Gitops
Various cluster components, most notable ones being cert-manager, external-dns, cluster-autoscaler, istio and keda
Kustomize and Helm in various places (through Flux)
Cloudflare for load balancing

I had a lot of fun onboarding myself and learning about the tools. I joined with the project about 2/3 done, so my contributions were modest.

A few things I can remember are:

Reworking the release step that did terragrunt apply to be able to handle multiple AWS accounts
Shellscripting around terragrunt run-all to only apply the terraform to folders that had changes (we learned that terragrunt was a poor choice, but by then it was too late)
Adding read-only access functionality to clusters
Sorting out cluster based security groups and OIDC so that pods could connect other hosts (ex: databases) and assume roles
And in the process I documented all my progress for future hires: How to onboard yourself (tools used & project structure), general cluster management (how to create & delete them and troubleshoot flux), how to grant people access, and a very rough guide on how to release a ruby service something to a cluster. I would later work on an entire project related to this

I also remember doing plenty of troubleshooting:

“Why is our IO so slow?” (I/O in gp2 volumes scales proportionally to size, they’re too small)
“Why is cloudflare giving me a handshake error?” (you set the host headers wrong)
“Why can’t I pull from ECR?” (cluster has no permissions to pull), “Why is external-dns not working?” (cluster has no permissions to change this zone) “Why is ingress not working” (you didn’t add the ingress tags to the public subnets) “Why is cert-manager not working?” (Oh, we got rate limited from letsencrypt) and so on…

Participating in the project is one of the fondest professional memories I have. Everything moved very quickly, every day brought a new challenge, and everyone was as sharp as they were kind.

Reflecting on it, I could say:

What we did well⌗

We collectively broke down large a project into manageable chunks of work and delivered features and improvements every week
Team cohesion was superb. There was no project lead. We all knew were we had to get and worked together to achieve it as fast as we could

What we did not do so well⌗

Developing many of our terraform modules inhouse was unnecessary. The widely used terraform-aws-eks public terraform module for EKS would’ve sufficient
The project structure was too complex. There were too many layers of kustomize that included kustomize that included kustomize that were then loaded by Fluxcd. Writing kustomize patches for individual environments was a painful experience
We chose some tools that did not serve us well. terragrunt being the most obvious. But some, like Istio, were not necessary. A simpler ingress solution would also have worked.
Overall we built too many bridges to nowhere. We overengineered for hypothetical future scenarios that never came to happen. We should’ve been more modest with our engineering

How I would’ve done things differently⌗

I probably should have argued more strongly in favour of keeping things simple when I had the chance. I remember we used Flux’s multitenancy feature (flux using git repos hierarchically) and I remember thinking “we don’t need this, we should keep it simple” but didn’t argue against it.

I didn’t because making the case for simplicity would’ve been difficult. The overall company culture at the time was “lets gooooo hypergrowth!!1!one” with 20 to 40 people joining every other week. It seemed that growth would never end and we would fix everything later when we had multiple SRE teams. We were wrong.

Perhaps nothing would’ve come out of it, but I should’ve spoken my piece anyway.

Wrote the k8s-sample-app, an app that documented & standardized how to release new services at Hopin⌗

The Kubenetes task force worked on bringing the Hopin monolith to AWS and k8s. But what about new services? Hopin was the fastest growing start up in Europe at the time, there were going to be a lot of new services, however, information on how to release software to our new infra was missing.

I went to my manager at the time and pitched the idea to him:

Hopin is a SaaS company. Releasing software is a big deal, but the process to release new software is not documented! :o
I know how to do this
I can write it all down
Gimme two or three weeks to do it

My manager said yes and I got to work. Here’s what the project covered:

Part 1: Local set up⌗

Setting up a Dockerfile for your project (with multiple base images & Gitlab docker cache usage when ran there)
Settings up docker-compose.yaml for local development. Covered how access to dev resources in AWS (like S3) for local dev as well as setting up usual services locally (postgrelsql & redis).
A brief explanation of AWS Systems Manager Parameter store, Chamber and how to use them

We evaluated localstack but chose not to use it, licensing was too expensive.

Part 2: Terraform and AWS resources⌗

An explanation on different environments at hopin and their purposes (production, staging, shared, development, etc)
A brief overview of our terraform repository and how to use it
An overview and examples of different AWS permissions and resources required for a service.
- What you needed for local development (an IAM user with sufficient permissions)
- What you needed for build time (a build time IAM role & ECR repo for your service)
- Permissions during run time (a runtime IAM role for your service)

Terraform examples for the above were made available that you could copy and paste and make work with almost no changes.

This chapter also covered how use our terraform modules to set up a relational database and cache if your project used it. The cache was easy, a database was more involved as you had to apply the terraform and then later create a user using the root credentials from inside k8s the cluster (we had no VPN at this time). Still, all of it was documented so you could be copy and paste your way to success :}.

Part 3: Release process⌗

This part contained a very simple Gitlab release process. The steps were build and run tests on push, deploy to staging on merge, and a manual step to deploy to production also on merge.

I suspected release processes would diverge from whatever example I wrote here, as every team would adopt whatever process suited them and their software best, and this would be reflected on their release pipeline.

I gave people the simplest thing that worked and let them take it from there.

Part 4: Kubernetes resources⌗

This step covered all the k8s related resources you needed to release your project on our stack.

The API objects we used were namespace, deployment, service, gateway, virtualservice and serviceaccount.

Beyond giving users a working set of yaml files that could be reused with just a few changes, this also covered:

Resource requests, limits and their function
Readiness and liveness probes
How Kustomize worked (there was a base folder and an overlay folder with per-environment patches)
How the serviceaccount was linked to an IAM role for run time permissions
How istio gateway and virtualservice resources were interpreted by external-dns (DNS entries were created)
How to access other services (use the internal service if possible, don’t do it through the internet)

I ALSO LINKED TO A LIST (what list?)

Part 4.1 Troubleshooting!⌗

Many people in the org were new to k8s, so first time was not always the charm.

This covered:

The absolute basics (get/describe pods, logs, events) and possible failure causes (image failing to download, container crashing, failed liveness/readiness probes)
Port forwarding (is your service working as intended?)
Validating your environment variables (is everything there from parameter store?)
Debugging connectivity to other EC2 based resources (ex: dbs) as well as the IAM permissions of your deployment
Debugging your ingress

Part 5: Cloudflare⌗

We hosted most public facing applications at Hopin behind a Cloudflare load balancer. This showed how to create one, which was easy – just about 2-3 terraform resources to copy and paste

Part 6: Monitoring⌗

This gave a very brief overview of the existing container/k8s monitoring on datadog and its useful dashboards.

There was an observability team in the company at this time – so they were the guys to follow up on with any additional set up.

Part 7: Advanced topics⌗

This briefly described some of Istio’s functionality (rate limits, HTTP routing, circuit breaking, time outs, rate limiting, etc)

Only HTTP routing on Istio was used by teams. The rest was never investigated.

So how did it all go?⌗

The project was a hit! About 15 new services went live in the year after the sample app was finished. Every single service organically adopted this app as a template.

Later down the road the standardization was also very useful to perform changes to all the services (ex: make them deploy to an additional cluster)

What we did well⌗

I was able to identify a need in the organization, offer a solution, work on it, and deliver it. The solution had great positive impact!

What we did not do so well⌗

I think I stopped a bit short. Time was very difficult to come by at this time, so after I got the bare minimum out the door, I had to attend to other tasks. I should’ve tried coming back to this to iteratively add a thing or two.

This would’ve been a great project to segway us into an internal developer platform set up, but layoffs starting happened and this could not be accomplished.

What I would have done differently⌗

As I look back, a few things were missing that would’ve been valuable:

A few sample datadog monitors could’ve been left in place for users to copy and paste. A basic blackbox HTTP endpoint check and a running pod counter would’ve gone a long way. A brief write up on SLI metrics and an example SLO would’ve been very beneficial as well
I should’ve at least added a CVE scanner to container images and/or consulted with folks in InfoSec what they would’ve like to have
At this time we were aspiring to do releases using GitOps. An example wouldn’t have hurt
I never added PodDisruptionBudget to the example, but should have

Drove our AWS cost categorization & reduction effort that saved Hopin about 1 million a year⌗

This was project was really cool. It started off boring, but it eventually went interesting places and before long I was hanging out with VPs of finance on videocalls and presenting things on the company all hands.

But how did we get there? Before this, we had:

Only a very rough idea of where the money was going. We could only get spend by service (like S3 and EC2)
Terraformed projects with all sorts of mismatching tags
A huge AWS bill

Part 1: How do we tag all our AWS resources?⌗

The first conversation that happened was between us and the folks in finance. We learned that they were interested in knowing spend by department, project, and if the spend was to serve customers or if it was just for R&D. We agreed to make spend by environment available (prod, staging, etc) and later we would combine these in a report.

So far so good, we had our requirements, but… how do we do this? Do we tag to every resource? Do we add tags as required arguments to modules? Do we do something else?

Eventually we stumbled upon the default_tags argument to the AWS provider and that just made sense. This is what we had to add to every instance of our AWS provider config:

provider "aws" {
  default_tags {
    tags = {
      department  = "finance"
      project     = "email sending service"
      environment = "production"
    }
  }
}

And it would propagate tags to all resources, even if they were being called from a module. Neat!

These tags were to be a requirement moving forward, so we also:

Created a terraform module named aws_tag_checker that received your default_tags as an argument and checked if they were in the correct format. project could be anything, but department and environment had to be chosen from a set of predefined values for reporting (ex: finance and production)
A pipeline step was added to check that you were calling aws_tag_checker from the folder that had terraform file changes folder. If you were not, the pipeline would stop.

In other words, if you didn’t have the tags required by the org, you couldn’t apply the terraform.

Fun trivia: Terraform had no “panic” mechanism as of 0.13.6. If you passed an invalid tag to aws_tag_checker, it would make terraform crash on the init phase by trying to open a file that did not exist.

“Adding is easy, we got this” I thought, before stumbling upon what awaited me…

Part 2: Oh god the config drift⌗

Hopin was a start up that grew very quickly, so some hacking and clickops took place. I naively tried creating a few pull requests to apply the above tags and immediately stumbled upon terraform config drift.

I hacked together some Golang that looped over all our terrafom folders, ran terraform plan on them and summarized that into a CSV, with a link to the actual plans uploaded to version control. We had a few hundred terraform folders and several dozens had some sort config drift.

Most config drift was solvable: I remember some RDS instances were bumped up in size, some S3 buckets had policies done by hand, and some resources had previously been deleted but the terraform was still there.

Folder by folder, I adjusted the terraform to match the drift, and created PRs applying the tags.

By the end of it I remember about 3 folders in non-production environments with terraform that was so misconfigured that I just couldn’t get it to work. It was always something related to S3 bucket policies for some reason. Ultimately I applied the tags by hand through the console, left the required_tags in the directory, and moved on.

Part 3: The grind⌗

With the drift solved, I hacked together some more Golang to loop over our terraform folders, add the required_tags functionality to them and create a pull request.

Each pull request contained 2 to 4 folders, and since almost every resource had tags added, the output of terraform plan for a single folder would often be thousands of lines.

Much credit to my teammates who helped review those over a few weeks, this would not have been possible without them! I think we went through about a hundred pull requests. Even though they reviewed them, I would always review and re-review them on my own. I really did not want anything going wrong.

Nothing went wrong!

Part 4: My friends in finance are happy⌗

With the tagging complete, the folks in finance were able to explain to senior leadership where our massive AWS spend was going, as well as reach out to individual teams requesting them lower their spend.

It was a great feeling to see them satisfied with our work. I was also surprised to find out how awesome the folks in finance were. And finance is no joke.

Part 5: Lets go lower our bill⌗

When Hopin was in hyperscaling mode, the way things were done was “just get it done, we’ll worry about cost later”

After the first round of layoffs, later had definitely arrived. We went through some low hanging fruits first:

The environment used for testing at scale (which had lots of large compute instances)
Abandoned projects. A large monitoring cluster was in this category

Just getting of those two got us 6 digits of savings per year. But there was much more we could do!

Compute & k8s⌗

I went on datadog to double check the cpu and memory requirements for every pod we had. As it turns out, we were requesting way more than we needed on all environments.

On non-production environments, the CPU requests of almost everything could be lowered to 10m as services do nothing most of the time, save for the occasional visit. The memory requirement depended on how much memory the service used at idle, I usually went for $idleMem * 1.3 and that was good enough.

Since we needed more CPU than we needed memory on non-production environments, I switched all the cluster instances to t3a.xlarge on non production environments. This was (I think) the instance with the cheapest memory (per-gb) available at the time.

On production we maintained our instance types, but lowered the CPU requirement of most services based on their actual usage as well for denser scheduling on nodes.

It is hard to quantify how much this saved as some of our capacity scaled with load. Our clusters became smaller by 20-50% in compute capacity on most environments after these changes.

S3⌗

Hopin stored a lot of recordings, and these were all being stored in S3 Standard storage.

Recordings are rarely accessed however, and after going on cost explorer and calculator.aws and doing some math taking into account storage cost, access cost, and our usage patterns over several buckets, changing from S3 standard storage to Glacier Instant retrieval category lowered our spend cost by about 70%

We had petabytes of recordings, so this was a significant improvement.

Aurora RDS⌗

Another culprit of our large spend was our overly cautious relational database disaster recovery plan. We were taking database snapshots every few hours and saving them for several days.

To our surprise, this cost a small fortune. We scaled back to once a day snapshots for short term storage and once a week snapshots for long term storage.

I investigated our current RDS capacity vs actual usage and we were waaaay overprovisioned. I wrote a runbook on sizing down our RDS instances (create smaller reader, failover, restart pods) and we eventually sized this down to a more modest capacity.

Other improvements⌗

We compared AWS Compute Savings plans, AWS EC2 Instance Savings Plans and EC2 reserved instances.

Reserved instances were what worked best for us and gave us the most savings.

We decided to buy a bunch of reserved instances based on our projected usage for the year. Sadly this got stalled waiting from approval from finance due to the sale of our department, but the savings would’ve been great!

Part 6: Savings were proudly presented on the company all hands (hoorah)⌗

I put together about a dozen slides highlighting our savings with RDS and S3 as these were the most impactful ones to present at the all hands.

I described our previous spend, our analysis of the problem, the changes we applied, and the amount we saved. Being able to present this made me very proud of the work I accomplished together with my team.

I had a lot of fun and maybe spoke too fast, I only had 5 minutes to present! I thanked everyone in the team and included Excel memes in the slides.

What we did well⌗

I’d say we executed this project almost as well as we could.

What we did not do so well⌗

Sometimes when I volunteered to lower the spend of certain services, I was told “Oh, that’s $deparment’s spend, that’s their problem” and well fair enough – but I usually see myself as someone who works for the company, not just $deparment.

In the end I ended up giving other teams some pointers on how to reduce spend (mostly around k8s resource requests) by creating some tables with their current requests and what requests would actually be accurate.

What I would have done differently⌗

I should’ve (diplomatically and within reason) advocated for a closer collaboration with other teams from the beginning – we all benefited in the end.

Created the second iteration of our k8s based compute infra to host hopin.com⌗

In 2023, the EKS version used on the first iteration of our k8s platform reached end of life, meaning we could not recreate the cluster if we had to.

Hopin had a rough 2022 with 3 rounds of layoffs. Out of a team of 12, after the third round I was the sole SRE in Events. We consolidated under a single events engineering team, with some of the folks who did backend also knowing a bit of their way around the infra as well.

To succeed moving forward, we had to:

Keep it as simple as we could & leverage ready-made solutions by the community
Still have it be backwards compatible with the previous platform as we didn’t have an engineering budget for big changes

The main changes done from our first set up were:

We adopted the terraform-aws-eks TF module, as well as the terraform-aws-eks-blueprints TF module for ArgoCD
We got rid of terragrunt and all our custom pipeline hacks in favour of atlantis
We used an existing TF monorepo (previous clusters had a separate repo)
We redid the kustomize – now every environment imports from a base folder with a patch, if needed (no multiple includes)
We adopted ArgoCD and Karpenter in favour of Fluxcd and cluster-autoscaler. Other businesses inside the company used Argocd, so we consolidated on that.
We kept most of the components: cert-manager, external-dns, istio, datadog agent, keda and opentelemetry

The yaml for the hopin monolith was generated by a tool named cdk8s. This was introduced during the migration to AWS and after a while no one liked having a bunch of typescript around just to generate yaml.

I suggested we replace it with envsubst and 2 variables in plain yaml with a folder per environment (prod & stg) and we went for it. Yes, plain yaml and envsubst. No kustomize, no helm, no nothing. We didn’t need it.

During the next months we adjusted IAM policies and security group settings to take into account the new clusters, and added extra release steps (and yaml) to various release processes to release to both old and new clusters simultaneously, and eventually everything was deployed to the new cluster. We then gradually shifted traffic with an eye on Datadog over several weeks and our migration was complete.

What we did well⌗

The platform creation was a success! There were zero infrastructure related during development & completion of this project.

What we did not do so well⌗

Nitpick: I found Fluxcd easier to use and maintain than ArgoCD. Flux having its own terraform provider for bootstraping as well as better variable support made me enjoy using it more. We eventually planned to release doing GitOps using Argo but that never came to pass, so the whole GitOps tool switch didn’t accomplish much (but now I know two tools, so that’s cool at least!)

What I would have done differently⌗

I would’ve budgeted some time during platform set up to investigate, do and document an EKS version upgrade on our platform to see if there was anything we needed to be concerned about. EKS releases are supported only for a short time, I wish they had an LTS offering.

Other accomplishments: simplifying TF modules, creating log based SLOs, writing Prometheus exporters and helping junior SREs⌗

Simplifying Terraform modules⌗

I remember we had some very unwieldy terraform modules when I started. One recurrent use case we had to create a s3 bucket + cloudfront distribution + DNS entry combo to host a frontend application.

To do this you had to do something like:

Instantiate an S3 module
Instantiate a Cloudfront + DNS module
Figure out how to plug Cloudflare into all of this with records for HTTP validation (no module)

Some of the modules even needed had extra resources that had to be terraformed outside the module to make it work. This made for a terrible experience.

I took a page from the Configuration Design and Best Practices chapter of the SRE workbook and created a module to do the above and only require 2 variables:

module "s3-cdn" {
  source         = "modules/s3-cdn"
  s3_bucket_name = myfrontend
  dns_entry      = myfrontend.hopin.com
}

You didn’t need to ask the user of your module any more questions. Other variables were made available with sane defaults that would work out of the box. We ended up deploying a bunch of static assets with this module.

Creating log based SLOs⌗

We wanted SLOs for our monolith application, but had zero application metrics to work with

But we did have application logs! The logs had the HTTP path of the request, the HTTP status, and the time to respond the request. From these we could extrapolate:

The application component involved (ex: /api/v1/someservice/)
The HTTP method (GET, POST)
Successful requests (200) and failed requests (500)
Response time in milliseconds

So far so good! These could give us response time and success rate SLIs. Our first idea was to configure the Datadog integration for Cloudflare logs and then extract metrics from logs in Datadog. We soon discovered this was going to cost 15k a month (datadog is wildly expensive) so we had to come up with another plan.

The alternate plan was:

Configure Cloudflare to ship logs to S3
Write a Logstash pipeline to parse all these logs and extract the relevant metric values from them
Use dogstatsd to push these metrics to datadog

It costs us about $10 a month to ship the logs in S3 and process them. That was much better!

Sadly however layoffs came along and this project was shelved. It would’ve been cool to set up at least some aspirational SLOs and then take it from there.

Writing Prometheus exporters⌗

We ran into two AWS resource limits that at the time I could not find metrics or exporters for:

We bumped into the limit number of Parameter Store entries for an AWS account (10k)
We bumped into the limit of images on a few ECR repositories (10k)

Both were due to faulty dev environment clean up processes after deployments. Reaching the ECR limit would block releases for the service that used the repository, but reaching the Parameter Store limit would block releases for everything that used that AWS account.

I wrote exporters to track both metrics. Alerts on these actually fired a few times until the responsible teams fixed their processes, so that was cool :}

I left behind a template on to write, deploy an exporter and import the metrics in Datadog, but sadly nothing came out of it as everyone started getting laid off soon afterwards.

Helping Junior SREs⌗

I very fondly remember helping two SREs accomplish their first tasks in the company and helping them undertand our tech stack. Soon enough they were getting after it on their own!