Deploying Hugging Face LLM Models with MLRun

Hugging Face has become a leading model repository, offering user-friendly tools for building, training and deploying ML models and LLMs models. In combination with MLRun, an open-source platform that automates data prep, tuning, validating and optimizing ML models and LLMs over elastic resources, Hugging Face empowers data scientists and engineers to bring their models to production more quickly and efficiently.

This blog post introduces Hugging Face and MLRun, demonstrating the benefits of using them together. It is based on the webinar “How to Easily Deploy Your Hugging Face Models to Production”, which includes a live demo of deploying a Hugging Face model with MLRun. The demo covers data preparation, a real application pipeline, post-processing and model retraining.

You can also watch the webinar, featuring Julien Simon, Chief Evangelist at Hugging Face, Noah Gift, MLOps expert and author, and Yaron Haviv, co-founder and CTO of Iguazio (acquired by McKinsey).

Hugging Face and LLMs

Hugging Face has gained recognition for its open-source library, Transformers, which provides easy access to pre-trained models. These include LLMs like BERT, GPT-2, GPT-3, T5 and others. These models can be used for various MLP tasks such as text generation, classification, translation, summarization and more.

By providing a repository of pre-trained models that users can fine-tune for specific applications, Hugging Face significantly reduces the time and resources required to develop powerful NLP systems. This enables a broader range of organizations to leverage advanced language technologies, thus democratizing access to LLMs.

The impact of Hugging Face’s LLMs spans various industries, including healthcare, finance, education and entertainment. For instance, in healthcare, LLMs can assist in analyzing medical records, extracting relevant information and supporting clinical decision-making. In finance, these models can enhance customer service through chatbots and automate the analysis of financial documents.

Now let’s see how Hugging Face LLMs can be operationalized.

Deploying Your Hugging Face LLM Model with MLRun

MLRun is an open-source MLOps orchestration framework that enables managing continuous ML and gen AI applications across their lifecycle, quickly and at scale. Capabilities include:

Automating data preparation, tuning, validation and model optimization
Deploying scalable real-time serving and application pipelines that include models, data and business logic
Built-in observability and monitoring for data, models and resources
Automated retraining and re-tuning
Flexible deployment options (multi-cloud, hybrid and on-prem)

Using MLRun with Hugging Face

Deploying Hugging Face models to production is streamlined with MLRun. Below, we’ll outline the steps to build a serving pipeline with your model and then retrain or calibrate it with a training flow that processes data, optimizes the model and redeploys it.

Workflow #1: Building a Serving Pipeline

Start by setting up a new project in MLRun.
Add a Serving Function – Define a serving function with the necessary steps. A basic serving function may include intercepting a message, pre-processing, performing sentiment analysis with the Hugging Face model and post-processing. You can expand this with additional steps and branching as needed.

Hugging Face models are integrated into MLRun, so you only need to specify the models you want to use.

Simulate Locally – MLRun provides a simulator for your serving function, allowing you to test it locally.
Test the Model – Push requests into the pipeline to verify its functionality. Debug as necessary.
Deploy the model as a real-world endpoint. This involves running a simple command, with MLRun handling the backend processes like building containers, pushing to repositories, and serving the pipeline. This results in an elastic, auto-scaling service.

Workflow #2: Building a Training Pipeline

Begin by creating a new project in MLRun.
Register Training Functions – Define the training functions, including the training methods, evaluation criteria and any other necessary information.
Set the Workflow – Outline the training steps, such as preparing datasets, training based on the prepared data, optimizing the model, and deploying the function. Models can be deployed to various environments (production, development, staging) simultaneously. These workflows can be triggered automatically with CI systems.
Run the Pipeline – Execute the training pipeline, which can be monitored through MLRun’s UI. Since MLRun supports Hugging Face, training artifacts are saved for comparisons, experiment tracking, and more.
Test the Pipeline – Verify that the model’s predictions have changed following the training.
Deploy the newly trained model.

Integrating Hugging Face with MLRuns significantly shortens the model development, training, testing, deployment,and monitoring processes. This helps operationalize gen AI, effectively and efficiently.

Learn more about MLRun and Hugging Face for your gen AI workflows.