LLMOps with MetaFlow
In today’s rapidly evolving landscape, Large Language Models (LLMs) are emerging as indispensable tools for businesses, driving tasks ranging from content summarization to dynamic content generation. With this widespread adoption comes the pressing need for streamlined, automated processes to develop and manage LLMs effectively in production environments. This blog delves into the pivotal role of such practices in maximizing the potential of LLMs, with a focus on leveraging MetaFlow as a framework.
Large Language Model Operations
Large Language Model Operations (LLMOps), in short, can be defined as the MLOps for LLMs. It encompases the set of best practices, techniques and tools for operational management of LLMs in production, from fine-tuning to deployment and maintenance. It includes:
- Infrastructure management — Streamlining the technical backbone for robust and efficient model operations.
- Prompt-Response Management — Refine LLM-based applications through continuous prompt-response optimization and quality control.
- Data and Workflow Orchestration — Automates and manages scalable workflows and data pipelines for efficient development and deployment of LLMs in production.
- Model Reliability and evaluation — Continuous performance monitoring to ensure correctness of the outputs and address biases.
- Security and compliance — Ensure compliance using best practices/standards along with use of tools to harden against adversarial attacks.
- Adaptation to technology evolution — Seamless integration of newer and more capable models, and improvements in Fine-tuning of models in the ecosystem, so that applications can leverage the progresses made easily.
A well thought out LLMOps brings efficiency to businesses by:
- helping them adopt technologies and develop applications faster to maintain their competitive position;
- helping them to scale their applications and deployments in a reliable way;
- helping them to continuously monitor their deployments ensuring quality service to customers, improving customer experience;
Designing the entire LLMOPs platform would probably take a book and is beyond the scope of this blog. So, this blog focuses building a LLMOps workflow using MetaFlow.
Workflow
For the purpose of this article, we are going to focus on building a pipeline for development(fine-tuning), evaluation and deployment of the LLM model.
This pipeline, could also be built using kubeflow or as an Airflow DAG. However, Metaflow offers a simplified approach to get started and its integration with Airflow and other cloud solutions, makes it an ideal tool for anyone starting to learn/build such pipelines.

The above figure shows the LLMOps workflow in the context of an large document summarization application. This is simplified LLMOps workflow. It is concerned about:
- Train/Test data preparation
- Model Selection
- Model fine-tuning
- Model evaluation
- Model deployment
Metaflow
Metaflow is an open-source Python human-friendly framework developed by Netflix to help data scientists build and manage Machine Learning Operations (MLOps) workflows. Technically, MetaFlow can be used to build workflow for both applications as well as operations.
- Install Metaflow in your system [1]
2. The different steps in the Metaflow work and their objective in a LLMOps workflow is shown in the following figure:

from metaflow import FlowSpec, step, IncludeFile
import pandas as pd
## Modify the SDK depending upon your environment
from openai import OpenAI
## I'm considering only the rouge score here.
from rouge_score import rouge_scorer
"""
A simple workflow using OpenAI SDK
"""
class LLMOps(FlowSpec):
"""
A very simple and uncomplicated LLMOPs workflow
"""
train_data = IncludeFile("train_data", default="./train_data.csv")
test_data = IncludeFile("test_data", default="./test_data.csv")
@step
def start(self):
self.client = OpenAI()
self.foundation_model = ["gpt-3.5-turbo", "gpt-4-turbo-preview"]
self.prompt = "Summarize the following text:"
self.max_tokens = 1024
self.scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
self.status_check_iterval = 60
@step
def data_acquistion(self):
import pandas as pd
from io import StringIO
self.train_df = pd.read_csv(StringIO(self.train_data), index_col=0)
self.test_df = pd.read_csv(StringIO(self.test_data), index_col=0)
@step
def data_preparation(self):
def append_prompt(value):
return self.prompt+value
self.train_df.dropna(inplace=True)
self.test_df.dropna(inplace=True)
self.train_df['prompt'] = self.train_df['prompt'].apply(append_prompt)
self.test_df['prompt'] = self.train_df['prompt'].apply(append_prompt)
self.next(self.model_run, foreach="foundation_model")
@step
def model_run(self):
## Model run is a dummy function, replace it your sdk function
sum_scores = len_scores = 0
for ix, row in self.train_df.iterrows():
response = self.client.Completion.create(
engine=self.input,
prompt=row['prompt'],
max_tokens=max_tokens)
sum_scores += self.scorer(response, row['completion'])
len_scores += 1
self.score = sum_scores/len_scores
self.next(self.join)
@step
def join(self, inputs):
mean_scores = [input.scores for input in inputs]
ix = argmax(mean_scores) ## We need a model with max rouge score
self.selected_model = self.foundation_model[ix]
@step
def model_tuning(self):
training_data_filename = "./training_data.jsonl"
self.train_df.to_json()
training_file_id = self.client.files.create(
file=open("training_data_filename", "rb"),
purpose="fine-tune")
response = self.client.fine_tuning.jobs.create(
training_file=training_file_id, ## substitute with yours
model=self.selected_model)
self.ft_jobid = response.id
self.ft_status = response.status
@step
def model_evaluation(self):
while self.ft_status not in ["succeeded", "failed"]:
sleep(self.status_check_iterval)
self.ft_status = self.client.fine_tuning.jobs.retrieve(self.ft_jobid).status
result = self.client.fine_tuning.jobs.list()
if self.ft_status == "succeeded":
sum_scores = len_scores = 0
for ix, row in self.train_df.iterrows():
response = self.client.Completion.create(
engine=result.data[0].fine_tuned_model,
prompt=row['prompt'],
max_tokens=max_tokens)
sum_scores += self.scorer(response, row['completion'])
len_scores += 1
self.score = sum_scores/len_scores
else:
self.score = None
@step
def end(self):
if self.score > self.threshold:
print("Fine tuned model is deployed in CHATGPT, and ready to use")
else:
print("It seems model needs reworking")
if __name__ == "__main__":
LLMOps()
Now the workflow, can be deployed in Airflow or any other cloud providers through the integrations provided by Metaflow.
Building an LLMOps workflow is challenging in a real-life scenario, if you like help in designing one for you, please feel free to reach out. Don’t forget to provide feedback/future topics in the comments. Thanks