Kubeflow Walkthrough

November 17, 2021

Setting up a Kubeflow Pipeline

I. Overview

In order to set up a client to run some hefty tensorflow models in a production setting, I opted to try Kubeflow because of its promise to leverage containerized code and manage computing resources. To my surprise, it took me 23 hours to set up, despite my professional experience using the suite of Google Cloud Platform (GCP) tools and leaning on friends who deploy to Kubeflow frequently. This post aims to walkthrough what I implemented and cover some of my pitfalls, in the hopes of equipping others for a more pleasant set up 🙂.

What is Kubeflow?

Kubeflow is a platform that provides a set of tools for running machine learning (ML) models end-to-end on top of a Kubernetes cluster. Kubeflow pipelines is an extension that allows us to automate and deploy ML workflows. I liked this Udemy course for walking through GCP set up with a preexisting pipeline (always search for a coupon code for Udemy, courses should never be more than $10-$20). This article is helpful for setting up a simple pipeline from scratch. Let's get started!

II. GCP Set Up

We'll get started by setting up an AI Platform Pipeline.

  1. Set your Viewer and Kubernetes Engine Admin permissions.1 gcloud projects get-iam-policy ${PROJECT_ID} --flatten="bindings[].members" --format="table(bindings.role, bindings.members)" --filter="bindings.role:roles/container.admin OR bindings.role:roles/viewer

  2. Within GCP, go to AI Platform > Dashboard and click `Configure`. Create a new cluster and then deploy that cluster.

  3. Once that cluster/pipeline is created, go to the Kubernetes cluster, delete the default node pool, and create a new node pool with the memory you'll need (I needed 64GB) AND "Container-Optimized OS with Docker (cos)". THIS WASTED A LOT OF TIME FOR ME. It turns out, because I was running a script within a Docker container, I'd have jobs that should have succeeded but were instead running indefinitely without an error because I was using a node pool that was not optimized for Docker. The only online mention of this I could find of this behavior was here

  4. In case you need to debug or examine the kubernetes pods (compute resources), we'll want to set up the kubectl CLI. On a GCP page, activate the cloud shell (icon in top left with [>-]). Enter the below code snippet.

    1 2 gcloud container clusters get-credentials ${CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE} kubectl get nodes -o wide
  5. Navigate to AI Platform > Open Pipelines Dashboard > Pipeline. We'll switch to writing our custom pipeline to upload here.

III. Writing our Custom Pipeline

In order to upload a pipeline, we will need to (1) create our pipline driver, which will compile our code into yaml and (2) create our components for executing our code.

  1. The pipeline driver code is below. Running this outputs a .yaml file that outputs the pipeline image at the top of the blog post. Importantly, (1) memory has to be explicitly requested and limited and (2) to run tasks in the correct order, they must accept the output of previous tasks. The data processing component can be cached, as is handled in the `configure_task` function.

  2. Next we want to set up our components. We'll walk through the data handling component set up. The data handler component will have two files: data_handler.py and data_handler.yaml. Note, in order for us to call the train components, we need an output file saved to pass along. In this case, the output file contains the path to our Google Cloud Storage (GCS) bucket where our workflow data, model weights, and train outputs are saved.

  3. Our train and reporting components are implemented in the same way: with .py and .yaml files. Similarly, the train component saves a result text file containing the GCS file paths.

IV. Create our Docker Image

Before we can create/upload our .yaml file, we need a docker image with our code for the Kubeflow Pipeline to reference.

  1. Create a docker account if you don't have one and download Docker locally. Confirm you are logged in. Freeze your requirements.

    1 2 Docker login pip freeze > requirements.txt
  2. Add a Dockerfile to your root folder.

  3. Build your Docker image, this may take a while the first time. Confirm your Google Cloud Registry (GCR) is enabled in GCP. This is where our docker image will live. Tag your image to a GCR path within your project. Push to GCR.

    1 2 3 docker build -t <docker_image_name> . docker tag <docker_image_name> gcr.io/${PROJECT_ID}/<docker_image_name> docker push gcr.io//${PROJECT_ID}/<docker_image_name>

V. Run our pipeline.

Once our docker image is created and updated with our containerized code, we're ready to deploy.

  1. Create our pipeline_driver.yaml by running our script.

    1 python pipeline_driver.py
  2. Upload this file to our AI Platform Kubeflow pipeline, create a run with desired parameters. Hopefully you have a running pipeline now!

• • •

I'm sure there's room for improvement in this post: please let me know how it can be clarified/updated. Email me at ashe.magalhaes@gmail.com!

More Like This

📍 San Francisco & New York

© 2022 Ashe Magalhaes.