Stephen Schlie


SRE/Platform/DevOps person who has a passion for solving complex problems and enjoys automating pain points away. I have worked in DevOps organizations for the better part of a decade, focusing on Kubernetes for the past 8 years.
I enjoy teaching people as much as I do learning new things myself. I find both to be incredibly rewarding when you get something to finally click, or when you are able to help someone get to that point on something they may have been struggling with.


Experience

Veritas Technologies - Senior Principal Site Reliability Engineer - (2023-2024)

SRE team lead for the ASP FedRamp project. Lead and educated the team in best practices for a cloud native deployment of the environment. Was given the opportunity to start from scratch with some requirements on what tools we needed for certain tasks. Working inside of Azure and deploying to AKS we were able to stand up a FIPS compliant environment to deploy to, and automated most maintenance updates.

Major accomplishments:

  • Set up workflow for Terraform using contained and tagged modules deployed with Terragrunt
    • Some basic things like linting and code reviews for the modules, automated tagging on merge
    • Linting and plans for deployments with more code reviews, applies done on merge
  • Deployed FIPS compliant AKS cluster
    • Setup automation to rebuild 3rd party software with FIPS validated modules, e.g. Istio, Argo
    • Automated Node updates to stay compliant with FedRamp guidelines on CVEs
  • Setup GitOps deployment model using Argo

Blue Owl - Senior Platform Engineer - (2020-2023)

Developing and maintaining our platform built on top of AWS, Terraform, and Kubernetes. I spent most of my days writing a mixture of Terraform, k8s (kustomize/Helm), and Go/Python/Bash, mixing in a bit of CI/CD here and there in Gitlab and Github.

  • Developed extendable and flexible Terraform modules for SREs to implement
  • Developed and maintained custom helm charts for engineering teams to use and extend for deployments
  • Developed and maintained Gitlab CI templates for teams to use in building CI pipelines
  • Focused on developer enablement by building self-service platform

Major accomplishments:

  • Led charge of upgrading our k8s clusters from ancient versions and streamlined process to make future upgrades as painless as possible.
  • Developed CI/CD pipeline to integrate, test, and deploy our application stack.
  • AWS costs savings project to reduce AWS spend by roughly 40%
    • A mix of cleaning cruft, rightsizing, and replacing some EKS node groups with spot instances
  • Rewrote our ephemeral environments platform for converting from Terraform 0.11 to 1.2
  • Implemented GitOps deployment flow with flux
    • orchestrated automated rollouts of new versions to our development environment
    • Simplified deployments by dropping helm in favor of kustomize for internal deployments

Public Library of Science - Site Reliability Engineer - (2018-2020)

Part of a small SRE for a non-profit maintaining a varied tech stack for the size. Our stack was built by an engineering org that was given free choice on what they used to make their apps. This varied from Java, Rust, Python, and Ruby. My day to day was a mix of old school sysadmin tasks keeping servers up, while helping SWEs and QA convert over to containers and integrating things like Prometheus into their apps.

  • Maintained on premises and cloud infrastructure in AWS
  • IaC tooling (Terraform, AWS CDK, Pulumi) and CasC (Salt) for deploying services
  • Led the effort in evaluating which IaC tool to adopt
  • Wrote custom Salt modules to cover some odds and ends
  • Developed several CLI tools to ease developer experience for local development
    • Automating standing up a local dev environment with Vagrant and salt
    • Tool to manage secrets in Vault
    • Automate common clean up and maintenance tasks for local, dev, and production environments
  • Developed CLI and Web applications using Golang and Python

Major Accomplishments:

  • Led effort to containerize our applications, and held training sessions on Docker and containers in general for engineering and QA
  • Trained engineers on IaC and how to deploy to AWS
  • Migrated from old Icinga based alerting system to more modern Prometheus, Grafana, and Alert Manager stack
  • Trained engineers on how to implement Prometheus into their applications
  • Led the charge on container orchestration to prevent us from re-inventing the wheel

Tigera - Software Engineer - (2016-2018)

Part of the small, dynamic engineering team which developed and supported Project Calico and CNX products.

  • Developer on Project Calico and CNX primarily using GoLang
  • Worked primarily on Open source versions of Calico and supporting code bases (libcalico-go, calicoctl, Canal, etc.) more information here https://github.com/projectcalico/
  • Improving documentation both internal and customer facing

Accomplishments

  • Integrated Calico and Canal into several 3rd party Kubernetes deployment tools such as kops
  • Integrated Calico into GKE
  • Converted Calico from storing state in etcd to TPRs, then again from TPRs to CRDs

Dell Networking - DevOps Engineer - (2015-2016)

Worked with team to streamline policies and implement new infrastructure for labs. Managed lab infrastructure and created automation and tools to speed up workflow.

  • Project planning for infrastructure improvements
  • Tool creation for the lab team and developers
    • Managing DNS, DHCP, Qualisystems, and other infrastructure
    • Connecting to resources over SSH or serial
    • Automating the adoption of new and missing resources into Qualisystems
  • Managed lab server infrastructure (Linux and Windows environment)

Major accomplishments:

  • Created Python library to wrap QualiSystems TestShell API to ease the creation of tools based around this product
  • Setup policy for static DHCP assignments to equipment in labs, with automated DNS additions when deploying equipment
  • Created VMWare ESXi cluster for engineering “VM Farm” to reduce turn around time for VM requests from engineering
  • Wrote tool to assist in deploying VMs via ESXi and PXE boot, and ensuring details are populated in Qualisystems

Technical Experience

Game Development
I enjoy game development on the side, I participate in game jams with a buddy of mine and sometimes solo if they allow CC0 art assets. I am also working on a larger game with the same buddy that we have been plunking away at for a few years.
Open Source
I’ve made minor contributions to various open source projects over my career, the most notable among them being to Kubernetes. Though that has stagnated over the last few years due to becoming a dad and just not having the spare time.
Programming Languages
Golang: Probably my most proficient language at this point. I worked on project Calico during my time at Tigera where working with it every day helped to accelerate my proficiency. I find writing things like CLI tools perfect for Go as passing a single binary around is much easier than a pyenv or container.

Python: I have used Python extensively in the past, I wrote a set of tools using it for Dell to automate provisioning of resources tying into Qulaisystems the lab management software they picked.

gd-script: Not very relevant but a fun Python like language used in the Godot game development engine. I enjoy making small games in my spare time, occasionally participating in game jams.

Random side projects

  • Built a 10-node Kubernetes cluster at home from ARM SBCs blog post
    • Automated node bringup and maintenance via Ansible
    • Used kubeadm for deployment of cluster
    • Three “stacked” control nodes running HA using kube-vip for load balancing