LLMOps with MetaFlow

4 min readMar 26, 2024

In today’s rapidly evolving landscape, Large Language Models (LLMs) are emerging as indispensable tools for businesses, driving tasks ranging from content summarization to dynamic content generation. With this widespread adoption comes the pressing need for streamlined, automated processes to develop and manage LLMs effectively in production environments. This blog delves into the pivotal role of such practices in maximizing the potential of LLMs, with a focus on leveraging MetaFlow as a framework.

Large Language Model Operations

Large Language Model Operations (LLMOps), in short, can be defined as the MLOps for LLMs. It encompases the set of best practices, techniques and tools for operational management of LLMs in production, from fine-tuning to deployment and maintenance. It includes:

Infrastructure management — Streamlining the technical backbone for robust and efficient model operations.
Prompt-Response Management — Refine LLM-based applications through continuous prompt-response optimization and quality control.
Data and Workflow Orchestration — Automates and manages scalable workflows and data pipelines for efficient development and deployment of LLMs in production.
Model Reliability and evaluation — Continuous performance monitoring to ensure correctness of the outputs and address biases.
Security and compliance — Ensure compliance using best practices/standards along with use of tools to harden against adversarial attacks.
Adaptation to technology evolution — Seamless integration of newer and more capable models, and improvements in Fine-tuning of models in the ecosystem, so that applications can leverage the progresses made easily.

A well thought out LLMOps brings efficiency to businesses by:

helping them adopt technologies and develop applications faster to maintain their competitive position;
helping them to scale their applications and deployments in a reliable way;
helping them to continuously monitor their deployments ensuring quality service to customers, improving customer experience;

Designing the entire LLMOPs platform would probably take a book and is beyond the scope of this blog. So, this blog focuses building a LLMOps workflow using MetaFlow.

Workflow

For the purpose of this article, we are going to focus on building a pipeline for development(fine-tuning), evaluation and deployment of the LLM model.

This pipeline, could also be built using kubeflow or as an Airflow DAG. However, Metaflow offers a simplified approach to get started and its integration with Airflow and other cloud solutions, makes it an ideal tool for anyone starting to learn/build such pipelines.

LLMOPs workflow, in the context of an LLM Application workflow — Fig 1. LLMOps Workflow

The above figure shows the LLMOps workflow in the context of an large document summarization application. This is simplified LLMOps workflow. It is concerned about:

Train/Test data preparation
Model Selection
Model fine-tuning
Model evaluation
Model deployment

Metaflow

Metaflow is an open-source Python human-friendly framework developed by Netflix to help data scientists build and manage Machine Learning Operations (MLOps) workflows. Technically, MetaFlow can be used to build workflow for both applications as well as operations.

Install Metaflow in your system [1]

2. The different steps in the Metaflow work and their objective in a LLMOps workflow is shown in the following figure:

from metaflow import FlowSpec, step, IncludeFile
import pandas as pd

## Modify the SDK depending upon your environment
from openai import OpenAI 

## I'm considering only the rouge score here. 
from rouge_score import rouge_scorer

"""
A simple workflow using OpenAI SDK
"""

class LLMOps(FlowSpec):
  """
    A very simple and uncomplicated LLMOPs workflow
  """
  train_data = IncludeFile("train_data", default="./train_data.csv")
  test_data = IncludeFile("test_data", default="./test_data.csv")

  @step
  def start(self):
    self.client = OpenAI()
    self.foundation_model = ["gpt-3.5-turbo", "gpt-4-turbo-preview"]
    self.prompt = "Summarize the following text:"
    self.max_tokens = 1024
    self.scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    self.status_check_iterval = 60 


  @step
  def data_acquistion(self):
    import pandas as pd 
    from io import StringIO
    self.train_df = pd.read_csv(StringIO(self.train_data), index_col=0)
    self.test_df = pd.read_csv(StringIO(self.test_data), index_col=0)
    

  @step
  def data_preparation(self):
    def append_prompt(value):
        return self.prompt+value
    self.train_df.dropna(inplace=True)
    self.test_df.dropna(inplace=True)
    self.train_df['prompt'] = self.train_df['prompt'].apply(append_prompt)
    self.test_df['prompt'] = self.train_df['prompt'].apply(append_prompt)
    self.next(self.model_run, foreach="foundation_model")

  @step
  def model_run(self):
    ## Model run is a dummy function, replace it your sdk function
    sum_scores = len_scores = 0
    for ix, row in self.train_df.iterrows():
        response = self.client.Completion.create(
                      engine=self.input,
                      prompt=row['prompt'],
                      max_tokens=max_tokens)
        sum_scores += self.scorer(response, row['completion'])
        len_scores += 1
    self.score = sum_scores/len_scores
    self.next(self.join)

  @step
  def join(self, inputs):
    mean_scores = [input.scores for input in inputs]
    ix = argmax(mean_scores) ## We need a model with max rouge score
    self.selected_model = self.foundation_model[ix]

  @step
  def model_tuning(self):
    training_data_filename = "./training_data.jsonl"
    self.train_df.to_json()
    training_file_id = self.client.files.create(
                                   file=open("training_data_filename", "rb"),
                                   purpose="fine-tune")
    response = self.client.fine_tuning.jobs.create(
                           training_file=training_file_id, ## substitute with yours
                           model=self.selected_model)
    self.ft_jobid = response.id
    self.ft_status = response.status
    
  @step
  def model_evaluation(self):
    while self.ft_status not in ["succeeded", "failed"]:
          sleep(self.status_check_iterval)
          self.ft_status = self.client.fine_tuning.jobs.retrieve(self.ft_jobid).status
    result = self.client.fine_tuning.jobs.list()
    if self.ft_status == "succeeded":
       sum_scores = len_scores = 0
       for ix, row in self.train_df.iterrows():
           response = self.client.Completion.create(
                      engine=result.data[0].fine_tuned_model,
                      prompt=row['prompt'],
                      max_tokens=max_tokens)
           sum_scores += self.scorer(response, row['completion'])
           len_scores += 1
       self.score = sum_scores/len_scores
    else:
        self.score = None

  @step
  def end(self):
     if self.score > self.threshold:
        print("Fine tuned model is deployed in CHATGPT, and ready to use")
     else:
        print("It seems model needs reworking")

if __name__ == "__main__":
    LLMOps()

Now the workflow, can be deployed in Airflow or any other cloud providers through the integrations provided by Metaflow.

Building an LLMOps workflow is challenging in a real-life scenario, if you like help in designing one for you, please feel free to reach out. Don’t forget to provide feedback/future topics in the comments. Thanks

LLMOps with MetaFlow

Large Language Model Operations

Workflow

Metaflow

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sivasathivel Kandasamy

No responses yet