Pierce Lamb: Creating a Machine Learning Pipeline on AWS Sagemaker Part One: Intro & Set Up

Or rather, creating a reusable ML Pipeline initiated by a single config file and five user-defined functions that performs classification, is finetuning-based, is distributed-first, runs on AWS Sagemaker, uses Huggingface Transformers, Accelerate, Datasets & Evaluate, PyTorch, wandb and more.

Image removed.

This post originally appeared on VISO Trust’s Blog

This is the introductory post in a three part series. To jump to the other posts, check out Creating a ML Pipeline Part 2: The Data Steps or Creating a ML Pipeline Part 3: Training and Inference

Introduction

On the Data & Machine Learning team at VISO Trust, one of our core goals is to provide Document Intelligence to our auditor team. Every document that passes through the system is subject to collection, parsing, reformatting, analysis, reporting and more. Part of that intelligence is automatically determining what type of document has been uploaded into the system. Knowing what type of document has entered the system allows us to perform specialized analysis on that document.

The task of labeling or classifying a thing is a traditional use of machine learning, however, classifying an entire document — which, for us, can be up to 300+ pages — is on the bleeding edge of machine learning research. At the time of this writing, researchers are racing to use the advances in Deep Learning and specifically in Transformers to classify documents. In fact, at the outset of this task, I performed some research on the space with keywords like “Document Classification/Intelligence/Representation” and came across nearly 30 different papers that use Deep Learning and were published between 2020 and 2022. For those familiar with the space, names like LayoutLM/v2/v3, TiLT/LiLT, SelfDoc, StructuralLM, Longformer/Reformer/Performer/Linformer, UDOP and many more.

This result convinced me that trying a multitude of these models would be a better use of our time than trying to decide which was the best among them. As such, I decided to pick one and use the experience of fine-tuning it as a proof-of-concept to build a reusable ML pipeline the rest of my team could use. The goal was to reduce the time to perform an experiment from weeks to a day or two. This would allow us to experiment with many of the models quickly to decide which are the best for our use case.

The result of this work was an interface where an experimenter writes a single config file and five user defined functions that kick off data reconciliation, data preparation, training or tuning and inference testing automatically.

When I set out on that proof-of-concept (pre-ML Pipeline), it took over a month to collect and clean the data, prepare the model, perform inference and get everything working on Sagemaker using distribution. Since building the ML Pipeline, we’ve used it repeatedly to quickly experiment with new models, retrain existing models on new data, and compare the performance of multiple models. The time required to perform a new experiment is about half a day to a day on average. This has enabled us to iterate incredibly fast, getting models in production in our Document Intelligence platform quickly.

What follows is a description of the above Pipeline; I hope that it will save you from some of the multi-day pitfalls I encountered building it.

ML Experiment Setup

An important architectural decision we made at the beginning was to keep experiments isolated and easily reproducible. Everytime an experiment is performed, it has its own set of raw data, encoded data, docker files, model files, inference test results etc. This makes it easy to trace a given experiment across repos/S3/metrics tools and where it came from once it is in production. However, one trade off worth noting is that training data is copied separately for every experiment; for some orgs this simply may be infeasible and a more centralized solution is necessary. With that said, what follows is the process of creating an experiment.

An experiment is created in an experiments repo and tied to a ticket (e.g. JIRA) like EXP-3333-longformer. This name will follow the experiment across services; for us, all storage occurs on S3, so in the experiment's bucket, objects will be saved under the EXP-3333-longformer parent directory. Furthermore, in wandb (our tracker), the top level group name will be EXP-3333-longformer.

Next, example stubbed files are copied in and modified to the particulars of the experiment. This includes the config file and user defined function stubs mentioned above. Also included are two docker files; one dockerfile represents the dependencies required to run the pipeline, the other represents the dependencies required to run 4 different stages on AWS Sagemaker: data preparation, training or tuning and inference. Both of these docker files are made simple by extending from base docker files maintained in the ML pipeline library; the intent is that they only need to include extra libraries required by the experiment. This follows the convention established by AWS’s Deep Learning Containers (DLCs) and, in fact, our base sagemaker container starts by extending one of these DLCs.

There is an important trade off here: we use one monolithic container to run three different steps on Sagemaker. We preferred a simpler setup for experimenters (one dockerfile) versus having to create a different container per Sagemaker step. The downside is that for a given step, the container will likely contain some unnecessary dependencies which make it larger. Let’s look at an example to solidify this.

In our base Sagemaker container, we extend:

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04

This gives us pytorch 1.10.2 with cuda 11.3 bindings, transformers 4.17, python 3.8 and ubuntu all ready to run on the GPU. You can see available DLCs here. We then add sagemaker-training, accelerate, evaluate, datasets and wandb. Now when an experimenter goes to extend this image, they only need to worry about any extra dependencies their model might need. For example, a model might depend on detectron2 which is an unlikely dependency among other experiments. So the experimenter would only need to think about extending the base sagemaker container and installing detectron2 and be done worrying about dependencies.

With the base docker containers in place, the files needed for the start of an experiment would look like:

https://medium.com/media/de90d5b8d6601d3975ea80c332e95e7f/href

In brief, these files are:

  • settings.ini: A single (gitignored) configuration file that takes all settings for every step of the ML pipeline (copied into the dockerfiles)
  • sagemaker.Dockerfile: Extends the base training container discussed above and adds any extra model dependencies. In many cases the base container itself will suffice.
  • run.Dockerfile: Extends the base run container discussed above and adds any extra run dependencies the experimenter needs. In many cases the base container itself will suffice.
  • run.sh: A shell script that builds and runs run.Dockerfile.
  • build_and_push.sh: A shell script that builds and pushes sagemaker.Dockerfile to ECR.
  • user_defined_funcs.py: Contains the five user defined functions that will be called by the ML pipeline at various stages (copied into the dockerfiles). We will discuss these in detail later.

These files represent the necessary and sufficient requirements for an experimenter to run an experiment on the ML pipeline. As we discuss the ML pipeline, we will examine these files in more detail. Before that discussion, however, let’s look at the interface on S3 and wandb. Assume that we’ve set up and run the experiment as shown above. The resulting directories on S3 will look like:

https://medium.com/media/823d0264d1199be7b6d3703cb0325616/href

The run_number will increment with each subsequent run of the experiment. This run number will be replicated in wandb and also prefixed to any deployed endpoint for production so the exact run of the experiment can be traced through training, metrics collection and production. Finally, let’s look at the resulting wandb structure:

https://medium.com/media/b6c2f56b011001028fd1e427080db31a/href

I hope that getting a feel for the interface of the experimenter will make it easier to understand the pipeline itself.

The ML pipeline

The ML pipeline will (eventually) expose some generics that specific use cases can extend to modify the pipeline for their purposes. Since it was recently developed in the context of one use case, we will discuss it in that context; however, below I will show what it might look like with multiple:

https://medium.com/media/06e61e98a0e3c0d02df5e515fcbb9c38/href

Let’s focus in on ml_pipeline:

https://medium.com/media/6a8f1af5aeb5c98051240440f5e42a92/href

The environment folder will house the files for building the base containers we spoke of earlier, one for running the framework and one for any code that executes on Sagemaker (preprocessing, training/tuning, inference). These are named using the same conventions as AWS DLCs so it is simple to create multiple versions of them with different dependencies. We will ignore the test folder for the remainder of this blog.

The lib directory houses our implementation of the ML pipeline. Let’s zoom in again on just that directory.

https://medium.com/media/78a4e37d0f6ce79cb18d2eea8de325c0/href

Let’s start with run_framework.py since that will give us an eagle eye view of what is going on. The skeleton of run_framework will look like this:

https://medium.com/media/38ea2a2ea16b2a7fd6a0d5fd4405b292/href

The settings.ini file a user defines for an experiment will be copied into the same dir (BASE_PACKAGE_PATH) inside each docker container and parsed into an object called MLPipelineConfig(). In our case, we chose to use Python Decouple to handle config management. In this config file, the initial settings are: RUN_RECONCILIATION/PREPARATION/TRAINING/TUNING/INFERENCE so the pipeline is flexible to exactly what an experimenter is looking for. These values constitute the conditionals above.

Note the importlib line. This line allows us to import use-case specific functions and pass them into the steps (shown here is just data reconciliation) using an experimenter-set config value for use case.

The moment the config file is parsed, we want to run validation to identify misconfigurations now instead of in the middle of training. Without getting into too much detail on the validation step, here is what the function might look like:

https://medium.com/media/281e4b8f338f30922d8311afaddebca9/href

The _validate_funcs function ensures that functions with those definitions exist and that they are not defined as pass (i.e. a user has created them and defined them). The user_defined_funcs.py file above simply defines them as pass, so a user must overwrite these to execute a valid run. _validate_run_num throws an exception if the settings.ini-defined RUN_NUM already exists on s3. This saves us from common pitfalls that could occur an hour into a training run.

We’ve gotten to the point now where we can look at each pipeline step in detail. You can jump to the second and third post via these links: Part Two: The Data Steps, Part Three: Training and Inference.

Image removed.

Nonprofit Drupal posts: April Drupal for Nonprofits Chat

Join us TOMORROW, Thursday, April 20 at 1pm ET / 10am PT, for our regularly scheduled call to chat about all things Drupal and nonprofits. (Convert to your local time zone.)

No pre-defined topics on the agenda this month, so join us for an informal chat about anything at the intersection of Drupal and nonprofits.  Got something specific on your mind? Feel free to share ahead of time in our collaborative Google doc: https://nten.org/drupal/notes!

All nonprofit Drupal devs and users, regardless of experience level, are always welcome on this call.

This free call is sponsored by NTEN.org and open to everyone. 

  • Join the call: https://us02web.zoom.us/j/81817469653

    • Meeting ID: 818 1746 9653
      Passcode: 551681

    • One tap mobile:
      +16699006833,,81817469653# US (San Jose)
      +13462487799,,81817469653# US (Houston)

    • Dial by your location:
      +1 669 900 6833 US (San Jose)
      +1 346 248 7799 US (Houston)
      +1 253 215 8782 US (Tacoma)
      +1 929 205 6099 US (New York)
      +1 301 715 8592 US (Washington DC)
      +1 312 626 6799 US (Chicago)

    • Find your local number: https://us02web.zoom.us/u/kpV1o65N

  • Follow along on Google Docs: https://nten.org/drupal/notes

View notes of previous months' calls.

Security advisories: Drupal core - Moderately critical - Access bypass - SA-CORE-2023-005

Project: Drupal coreDate: 2023-April-19Security risk: Moderately critical 13∕25 AC:Basic/A:None/CI:Some/II:None/E:Theoretical/TD:AllVulnerability: Access bypassDescription: 

The file download facility doesn't sufficiently sanitize file paths in certain situations. This may result in users gaining access to private files that they should not have access to.

Some sites may require configuration changes following this security release. Review the release notes for your Drupal version if you have issues accessing private files after updating.

This advisory is covered by Drupal Steward.

We would normally not apply for a release of this severity. However, in this case we have chosen to apply Drupal Steward security coverage to test our processes.

Drupal 7

  • All Drupal 7 sites on Windows web servers are vulnerable.
  • Drupal 7 sites on Linux web servers are vulnerable with certain file directory structures, or if a vulnerable contributed or custom file access module is installed.

Drupal 9 and 10

Drupal 9 and 10 sites are only vulnerable if certain contributed or custom file access modules are installed.

Solution: 

Install the latest version:

All versions of Drupal 9 prior to 9.4.x are end-of-life and do not receive security coverage. Note that Drupal 8 has reached its end of life.

Reported By: Fixed By: 

LN Webworks: 7 ways to enhance your ecommerce Website and online sales with Drupal

Drupal is an open-source content management software that enables companies to create captivating e-commerce websites and online stores. It has given new life to the world of digital commerce. Fortune 500 companies like Tesla and General Electric have unleashed the power of Drupal commerce to create cutting-edge digital experiences for their customers. This brings one question to the mind, “What makes this content management system the popular choice of these eminent companies?” A brief yet all-encompassing answer to this question is that this software complements the ever-evolving digital trends. As consumer behavior changes and evolves with the technological revolution, Drupal helps you match strides with it and consistently deliver the best digital experience. Given that, it wouldn’t be wrong to call it a stepping stone to creating a thriving online store.

Golems GABB: Cleaning Up Database to Speed Up Development Cycles

Cleaning Up Database to Speed Up Development Cycles Editor Tue, 04/18/2023 - 17:37

If you are developing websites, you would like to make this process as pleasant as possible. You probably agree that the time spent on the work you do is also an important factor. Therefore, all of us developers want to develop websites as quickly as possible while spending as little effort and energy as possible.
In this article, we will look at Drupal performance optimization and how you can clear the database tables of Drupal modules and MySQL tables. The following information will help you speed up development cycles. And also, among other things, it will help you feel a little happier after a hard day's work! But first of all, you need to understand what a database table is.

Consensus Enterprises: Aegir5: Front-end UI architecture

In our previous post, we looked at the Tasks and Operations which form the building blocks for the user interface in Aegir5. Here we’ll look at the additional entities required to support the Kubernetes-based backend framework. It is worth noting that Aegir has always had a tension between Developer and SysAdmin use-cases. We’ll cover this in more depth in a later post. For the moment, we’re focused on the Developer use-case.

Peoples Blog: Fix Colima connection refused error: failed to get Info from .lima/colima/ha.sock on Mac

This article is about fixing only a single error which you see with Colima on Mac machines. This might be a simple & specific issue, but people who are facing this issue will really feel grateful with the solution provided. While you are running Colima on your mac machines, generally you get into such issue, when you power off or shut down your mac, without stopping the colima service (and de

PreviousNext: Why a culture of open-source contribution is good for your business

Contributing makes good business sense, especially when open-source technology, such as Drupal, is at the core of everything you do (pun intended!). 

by Owen Lansbury / 19 April 2023

Based on a talk given at EverythingOpen 2023. A video of that presentation is also available at the end of this article.

Why do we contribute to the Drupal community? 

Adopting a formalised approach to contribution helps our business stay sustainable in the long term. It also has the added benefit of helping everyone else in the open-source community.

Image removed.

Reputation

Over the years at PreviousNext, we’ve honed a deep expertise in Drupal. That’s because we’ve doubled down and avoided diluting our technical offering. We’re all in for Drupal. 

This level of knowledge sees us regularly referred to clients looking for hard hitters in the Drupal space. Our expertise is particularly appealing, as it happens, for our Higher Education and Government clients. Being Australia’s only Platinum Certified Drupal Partner can only help in this regard.

Our Drupal Association profile records all our contributions as ‘credits’. These determine our ranking as a certified partner, demonstrating our commitment to Drupal as a technology and a community.

Image removed.

We focus on raising our Drupal profile using means other than traditional marketing methods. Our team attends events, volunteers at DrupalSouth, presents at conferences, sponsors the DrupalSouth CodeSprint, and takes on community leadership roles. 

This level of involvement cements our position as a leading Drupal provider. It also gives all members of our team (including those who are non-technical) additional opportunities to be part of the community and raise their profiles.

Professional development

I like to refer to Drupal as a ‘do-ocracy’. Everyone is welcome, and all help is welcome. Open-source and open-handed. It’s the same sense of community that we value at PreviousNext.

When someone first joins our business, we often use open-source contributions as the primary method of onboarding them. This induction method encourages them to develop best practices in their coding and use their involvement in the Drupal project as part of their ongoing professional development.

An offshoot of this is the chance to build relationships and be mentored by people external to our organisation. It’s a unique opportunity to broaden our collective perspectives and work alongside (and become!) some of the brightest minds in open-source tech.

A happier team

Avoiding team member burnout or a lacklustre approach to work is vital for us as a smaller organisation. Instead, we help staff to scratch those different ‘itches’.

Working on contrib helps to maintain interest and passion by giving staff time to work on projects that aren’t run-of-the-mill client engagements. It also exposes our team to larger initiatives than they might otherwise work on.

Staff retention

A happier team, in turn, leads to a more stable team over the long term. Our retention rates have steadied at around three times the industry average. 

This tendency towards longevity also facilitated our decision to make PreviousNext employee-owned.

How do we contribute? An established framework 

Enshrined in our Staff Handbook is the hope that employees at PreviousNext will use 20% of their time for contrib (the remaining 80% is billable client work). If a team member chooses not to contribute, they work closer to fully billable hours.

We don’t expect staff to contribute outside their employed hours–though many do for their own interest.

With a robust time-tracking and self-management culture, this approach works well and leads to a productive, well-run company.

We’ve also baked open-source contributions into our regular ‘Hackdays’. These are days when our developers get together and innovate. This focused work feeds into our client projects and becomes part of our Drupal contributions. 

Other methods for ensuring a regular flow of code include directly sponsoring developers, which helps us maintain our partnership status. 

We also use project-based sponsorship to contribute patches and new modules to the Drupal ecosystem. The clients for these projects also receive credits for sponsoring this development.

Being a good Drupal citizen

Open-source contribution isn’t just about altruism. It also shouldn’t be viewed as a drain on a business’s income generation. It’s about recognising that our businesses depend on a technological ecosystem that in turn relies on as many of us playing our part to advance it as possible.

When it comes to Drupal, the result of these contributions is a platform that commands a 10% share of the top 10,000 most visited websites globally. Clearly, though, there is more to be done to promote Drupal even further. It’s something we can all get behind, because when our chosen open-source platform thrives, so do our businesses.

 

Watch the video