Deploying Hugging Face LLM Models with MLRun
Hugging Face has become a leading model repository, offering user-friendly tools for building, training and deploying ML models and LLMs models. In combination with MLRun, an open-source platform that automates data prep, tuning, validating and optimizing ML models and LLMs over elastic resources, Hugging Face empowers data scientists and engineers to bring their models to production more quickly and efficiently.
This blog post introduces Hugging Face and MLRun, demonstrating the benefits of using them together. It is based on the webinar “How to Easily Deploy Your Hugging Face Models to Production”, which includes a live demo of deploying a Hugging Face model with MLRun. The demo covers data preparation, a real application pipeline, post-processing and model retraining.
You can also watch the webinar, featuring Julien Simon, Chief Evangelist at Hugging Face, Noah Gift, MLOps expert and author, and Yaron Haviv, co-founder and CTO of Iguazio (acquired by McKinsey).
Hugging Face has gained recognition for its open-source library, Transformers, which provides easy access to pre-trained models. These include LLMs like BERT, GPT-2, GPT-3, T5 and others. These models can be used for various MLP tasks such as text generation, classification, translation, summarization and more.
By providing a repository of pre-trained models that users can fine-tune for specific applications, Hugging Face significantly reduces the time and resources required to develop powerful NLP systems. This enables a broader range of organizations to leverage advanced language technologies, thus democratizing access to LLMs.
The impact of Hugging Face’s LLMs spans various industries, including healthcare, finance, education and entertainment. For instance, in healthcare, LLMs can assist in analyzing medical records, extracting relevant information and supporting clinical decision-making. In finance, these models can enhance customer service through chatbots and automate the analysis of financial documents.
Now let’s see how Hugging Face LLMs can be operationalized.
MLRun is an open-source MLOps orchestration framework that enables managing continuous ML and gen AI applications across their lifecycle, quickly and at scale. Capabilities include:
Deploying Hugging Face models to production is streamlined with MLRun. Below, we’ll outline the steps to build a serving pipeline with your model and then retrain or calibrate it with a training flow that processes data, optimizes the model and redeploys it.
Hugging Face models are integrated into MLRun, so you only need to specify the models you want to use.
Integrating Hugging Face with MLRuns significantly shortens the model development, training, testing, deployment,and monitoring processes. This helps operationalize gen AI, effectively and efficiently.
Tt transforms models from research prototypes into real-world tools that deliver value. Deployment enables organizations to embed AI capabilities like chatbots, copilots, analytics assistants, and domain-specific knowledge engines into workflows, making advanced reasoning and natural language understanding accessible at scale. It also ensures that models can be integrated with enterprise systems, comply with governance requirements, and provide measurable ROI. Without deployment, LLMs remain experiments rather than operational assets.
Hugging Face models can be fine-tuned before deployment to better suit specific domains, tasks, or data. Hugging Face provides libraries such as PEFT (Parameter-Efficient Fine-Tuning) that make fine-tuning more accessible and cost-effective. Fine-tuning should be followed by evaluation and testing to ensure the model generalizes well.
Measure latency, throughput, error rates, accuracy, groundedness, hallucination rates, and relevance. Tools like MLRun can be used either natively or by integrating with your monitoring tool of choice.
Many models on Hugging Face’s Model Hub are community-contributed and vary in quality, documentation, and licensing, so directly deploying Hugging Face without adaptation can introduce risks. For enterprise use, organizations often need to fine-tune, harden, and validate models before declaring them production-ready.
MLRun is an open-source MLOps orchestration framework that enables flexible deployment of AI models across environments. It supports running models on Kubernetes, serverless functions, batch jobs, or real-time pipelines, in the cloud or on-premises, making it easier to adapt deployment to specific workloads. This flexibility ensures that organizations can choose the most cost-efficient and scalable option for each use case.
Learn more about MLRun and Hugging Face for your gen AI workflows.