New Blog: Bringing (Gen) AI from Laptop to Production with MLRun

Deploying Hugging Face LLM Models with MLRun

Deploying Hugging Face LLM Models with MLRun

Hugging Face has become a leading model repository, offering user-friendly tools for building, training and deploying ML models and LLMs models. In combination with MLRun, an open-source platform that automates data prep, tuning, validating and optimizing ML models and LLMs over elastic resources, Hugging Face empowers data scientists and engineers to bring their models to production more quickly and efficiently.

This blog post introduces Hugging Face and MLRun, demonstrating the benefits of using them together. It is based on the webinar “How to Easily Deploy Your Hugging Face Models to Production”, which includes a live demo of deploying a Hugging Face model with MLRun. The demo covers data preparation, a real application pipeline, post-processing and model retraining.

You can also watch the webinar, featuring Julien Simon, Chief Evangelist at Hugging Face, Noah Gift, MLOps expert and author, and Yaron Haviv, co-founder and CTO of Iguazio (acquired by McKinsey).

Hugging Face and LLMs

Hugging Face has gained recognition for its open-source library, Transformers, which provides easy access to pre-trained models. These include LLMs like BERT, GPT-2, GPT-3, T5 and others. These models can be used for various MLP tasks such as text generation, classification, translation, summarization and more.

By providing a repository of pre-trained models that users can fine-tune for specific applications, Hugging Face significantly reduces the time and resources required to develop powerful NLP systems. This enables a broader range of organizations to leverage advanced language technologies, thus democratizing access to LLMs.

The impact of Hugging Face’s LLMs spans various industries, including healthcare, finance, education and entertainment. For instance, in healthcare, LLMs can assist in analyzing medical records, extracting relevant information and supporting clinical decision-making. In finance, these models can enhance customer service through chatbots and automate the analysis of financial documents.

Now let’s see how Hugging Face LLMs can be operationalized.

Deploying Your Hugging Face LLM Model with MLRun

MLRun is an open-source MLOps orchestration framework that enables managing continuous ML and gen AI applications across their lifecycle, quickly and at scale. Capabilities include:

  • Automating data preparation, tuning, validation and model optimization
  • Deploying scalable real-time serving and application pipelines that include models, data and business logic
  • Built-in observability and monitoring for data, models and resources
  • Automated retraining and re-tuning
  • Flexible deployment options (multi-cloud, hybrid and on-prem)

Using MLRun with Hugging Face

Deploying Hugging Face models to production is streamlined with MLRun. Below, we’ll outline the steps to build a serving pipeline with your model and then retrain or calibrate it with a training flow that processes data, optimizes the model and redeploys it.

Workflow #1: Building a Serving Pipeline

  1. Start by setting up a new project in MLRun.
  2. Add a Serving Function – Define a serving function with the necessary steps. A basic serving function may include intercepting a message, pre-processing, performing sentiment analysis with the Hugging Face model and post-processing. You can expand this with additional steps and branching as needed.

Hugging Face models are integrated into MLRun, so you only need to specify the models you want to use.

  1. Simulate Locally – MLRun provides a simulator for your serving function, allowing you to test it locally.
  2. Test the Model – Push requests into the pipeline to verify its functionality. Debug as necessary.
  3. Deploy the model as a real-world endpoint. This involves running a simple command, with MLRun handling the backend processes like building containers, pushing to repositories, and serving the pipeline. This results in an elastic, auto-scaling service.

Workflow #2: Building a Training Pipeline

  1. Begin by creating a new project in MLRun.
  2. Register Training Functions – Define the training functions, including the training methods, evaluation criteria and any other necessary information.
  3. Set the Workflow – Outline the training steps, such as preparing datasets, training based on the prepared data, optimizing the model, and deploying the function. Models can be deployed to various environments (production, development, staging) simultaneously. These workflows can be triggered automatically with CI systems.
  4. Run the Pipeline – Execute the training pipeline, which can be monitored through MLRun’s UI. Since MLRun supports Hugging Face, training artifacts are saved for comparisons, experiment tracking, and more.
  5. Test the Pipeline – Verify that the model’s predictions have changed following the training.
  6. Deploy the newly trained model.

Integrating Hugging Face with MLRuns significantly shortens the model development, training, testing, deployment,and monitoring processes. This helps operationalize gen AI, effectively and efficiently.

FAQs

What is the significance of deploying LLM applications?

Tt transforms models from research prototypes into real-world tools that deliver value. Deployment enables organizations to embed AI capabilities like chatbots, copilots, analytics assistants, and domain-specific knowledge engines into workflows, making advanced reasoning and natural language understanding accessible at scale. It also ensures that models can be integrated with enterprise systems, comply with governance requirements, and provide measurable ROI. Without deployment, LLMs remain experiments rather than operational assets.

Can I fine-tune a Hugging Face model before deployment?

Hugging Face models can be fine-tuned before deployment to better suit specific domains, tasks, or data. Hugging Face provides libraries such as PEFT (Parameter-Efficient Fine-Tuning) that make fine-tuning more accessible and cost-effective. Fine-tuning should be followed by evaluation and testing to ensure the model generalizes well. 

What are common challenges faced when retraining LLM models?

  1. A) Large models require significant GPU or TPU resources, making frequent retraining expensive. B) If the new training data is noisy, biased, or incomplete, it can degrade performance. C) Catastrophic forgetting, where retraining on new data causes the model to lose accuracy on previously learned knowledge. D) Compliance, especially if retraining data includes sensitive or personally identifiable information. E) LLM retraining adds operational complexity: managing model versions, tracking experiment metadata, and ensuring reproducibility require strong MLOps practices.

How do I monitor the performance of my deployed LLM applications?

Measure latency, throughput, error rates, accuracy, groundedness, hallucination rates, and relevance. Tools like MLRun can be used either natively or by integrating with your monitoring tool of choice.

Are Hugging Face models production-ready?

Many models on Hugging Face’s Model Hub are community-contributed and vary in quality, documentation, and licensing, so directly deploying Hugging Face without adaptation can introduce risks. For enterprise use, organizations often need to fine-tune, harden, and validate models before declaring them production-ready.

How does MLRun support flexible deployment options for AI models?

MLRun is an open-source MLOps orchestration framework that enables flexible deployment of AI models across environments. It supports running models on Kubernetes, serverless functions, batch jobs, or real-time pipelines, in the cloud or on-premises, making it easier to adapt deployment to specific workloads. This flexibility ensures that organizations can choose the most cost-efficient and scalable option for each use case.

Learn more about MLRun and Hugging Face for your gen AI workflows.

Recent Blog Posts
MLRun v1.8 Release: with Smarter Model Monitoring, Alerts and Tracking
MLRun v1.8 adds features to make LLM and ML evaluation and monitoring more accessible, practical and...
Gilad Shaham
June 16, 2024
Introducing MLRun v1.10: Build better agents, monitor everything
We’re proud to announce a series of advancements in mlrun v1.10 designed to power your end-to-end or...
Michal Eshchar
June 16, 2024
Introducing MLRun Community Edition
MLRun CE is the out-of-the-box solution of MLRun for AI and ML orchestration and model lifecycle man...
Gilad Shapira
June 16, 2024