Fine-tuning a Mistral Language Model with Anyscale

Cover Image for Fine-tuning a Mistral Language Model with Anyscale

Let’s say you’re an engineer at a large company, tasked with productionizing a large language model (LLM) for a new client-facing chat interface.

While pre-trained models are the best place to start in most cases, as time goes on you may want to control the overall behavior and “feel” of chatting with this model beyond what the base model offers.

This can be done through clever prompt engineering. However, it can severely limit the length of the input chat, and costs for every inference call will increase. There is another option—you can fine-tune your model on a dataset, and instill new behavioral patterns in your model without having to do prompt engineering at all.

How to fine-tune a Mistral 7B model

In this tutorial, we'll guide you through the process of fine-tuning a Mistral7B open-source LLM. We'll use MindsDB to interact with the model and Anyscale Endpoints to host it. If you needed to fine-tune another model, the process would be somewhat similar.

Please reach out to us via Slack for help.

What is MindsDB?

MindsDB is an open-source AI platform for developers that connects AI/ML models with real-time data. It provides tools and automation to easily build and maintain personalized AI solutions.

With MindsDB, you can bring state-of-the-art AI models—like OpenAI, LLama2, Cohere, Mistral, and others—together with hundreds of data sources, including enterprise databases and third-party apps like PostgreSQL, MongoDB, Snowflake, Slack, Shopify, and more. The platform offers hundreds of integrations to various data sources, APIs, and machine learning frameworks.

What is Anyscale?

In this last category, Anyscale Endpoints comes into play. It is a service offered by Anyscale, creators of the Ray framework which powers some of the most cutting-edge AI products in the world, like OpenAI’s ChatGPT. Their endpoints enable easy and optimized access (think fast, and cheaper than alternatives) to open-source LLMs like the Mistral 7B model.

In what follows, we will compare the resulting fine-tuned model against its original version, inspecting differences in behavior and explaining what is actually happening behind the scenes. We will also take a look at how fine-tuning relates to prompt engineering, another fundamental component when building robust pipelines for generative AI applications in the LLM era.

The tutorial is structured in steps, to facilitate following along. Let’s get started:

Step 1: Get your prerequisites in order

  1. Get a local deployment of MindsDB working in your machine.

  2. Make sure you install the requirements for the Anyscale Endpoints integration.

  3. Additionally, you need to sign up for Anyscale Endpoints.

Step 2: Load data

We will be using this dataset, which contains triples of context-question-answers that relate a SQL table (its original “create” query) with a natural language question prompted by a user, and a ground truth “correct” answer query that returns the information this user is after.

Even though the dataset contains well over 70,000 examples, for the purposes of this tutorial we will consider only the first 300 examples. The data is rearranged into “long” format, with conversations/chats being vertically stacked as rows. That is, every row has a “content” message and a “role," which can be “system,” “assistant,” (our model) or “user.”

The content on each system row is where we define a couple of things:

  1. that the task is to translate a question into a SQL query, and

  2. what is the relevant reference SQL table creation statement. Notice that assistant rows are our definition of what a perfect answer would be.

Data can be easily checked by entering the following:

SELECT * FROM example_db.demo_data.anyscale_endpoints_ft_sample_data LIMIT 10;

This retrieves a sample from our example Postgres database, which should look similar to the screenshot below:

Notice how the assistant’s answers are the queries, without any accompanying explanations. This is the main effect we want to observe when compared to the starting base model. Now, let’s set that up.

Step 3: Set up the base model

a. Create an instance of the Anyscale integration in your MindsDB project:

CREATE ML_ENGINE anyscale_endpoints FROM anyscale_endpoints;

b. Create a vanilla Mistral 7B model with it (the specific model name we’re using here is mistralai/Mistral-7B-Instruct-v0.1):

PREDICT answer
engine = 'anyscale_endpoints',
model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
prompt_template = '{{content}}',
api_key = 'your-api-key-here';

c. At this point, you can already query the model and get some answers out of it, like this:

SELECT answer FROM ft_sql WHERE content = 'hi!';

d. In fact, why not try one of the many examples we left out from the original dataset? Let’s see:

SELECT answer FROM ft_sql WHERE content = 'Hi! I have created a SQL table with this command: “CREATE TABLE employees (country VARCHAR)”. How should I query my database to know how many employees are living in Canada?';

You should get back something like:

While the above is a correct answer, it is a bit verbose. It's also perhaps not the best output format if you’re deploying this model in the context of a broader software application like a code co-pilot. What if we wanted the output to be succinct?

At this point, we should highlight that the main effect of fine-tuning as executed on a pre-trained large language model is that it will tend to affect the overall behavior and alignment of it, rather than perfectly memorizing all factual information contained in new examples. This of course depends on many variables, but it is the overall trend when dealing with large datasets.

If, instead, what you need is to extract specific real data as stored somewhere else, then your best bet would be to use a RAG setup. Do note these are not mutually exclusive, you definitely can employ both techniques at the same time.

Step 4: Fine-tune the model

e. With this dataset in hand, and the simplicity of both MindsDB and Anyscale Endpoints, we can fine-tune a Mistral 7B large language model with a single SQL command:

FINETUNE ft_sql FROM example_db (
SELECT * FROM demo_data.anyscale_endpoints_ft_sample_data LIMIT 300

f. This command will trigger a fine-tuning run on the Anyscale Endpoints platform. As the model has 7 billion parameters and the dataset is tiny, this process should take around 15 minutes, but depending on these two things it could take longer.

You will get a notification in the email you used to sign up for Anyscale Endpoints (like in the screenshot), after which the MindsDB model will also show as ready to go.

Visually, the exact pipeline built here will look like the following diagram:

At this point, we can expand on what is happening. Mistral 7B is a large neural network that closely follows the “transformer” architecture.

For such a model, “supervised fine-tuning” roughly means slightly modifying some weights of the pre-trained base model. It will do this by trying to minimize an error metric on a portion of the new data, along with some special considerations (e.g. lower learning rate).

One fundamental difference compared to the base pre-training stage is that data is labeled, as opposed to unlabeled (the label provides “supervision”.) As you can see from the steps above, we provide the response expected of an ideal “assistant,” and this will inform the training procedure to obtain a model that is better at producing these answers than the original one.

It is also crucial that moving forward, users interact with the model in a similar way to that shown in the data used for fine-tuning.

We can try the model out like this:

SELECT answer FROM ft_sql WHERE content = 'Hi! I have created a SQL table with this command: “CREATE TABLE employees (country VARCHAR)”. How should I query my database to know how many employees are living in Canada?';

Output answer column should be: SELECT COUNT(*) FROM employees WHERE country = "Canada". Much more succinct!

MindsDB features automatic model version control, and you access previous model versions through the MODEL_NAME.VERSION notation. We can already see how the answer is much more direct and to the point when compared to the original version, which you can retrigger by entering:

SELECT answer FROM ft_sql.1 WHERE content = 'Hi! I have created a SQL table with this command: “CREATE TABLE employees (country VARCHAR)”. How should I query my database to know how many employees are living in Canada?' USING max_tokens=1000;

This will give you a similar answer to the one in section 3.d.

As further clarification, while our newly fine-tuned model does behave quite differently, it is important to still check accuracy metrics empirically, as it is known that LLMs need “space” to produce better answers. This space is measured in tokens, which directly leads to a lengthier response.

This means that while lengthy answers may be more difficult to parse, depending on the task this could actually be a desirable property to increase accuracy in downstream tasks. (So, evaluate accordingly.)

Step 5: What about prompt engineering?

There is another interesting observation here. As demonstrated back when OpenAI released GPT-2, large language models can effectively learn “in-context,” which in today’s parlance equates to the so-called “prompt”.

This is why prompt engineering has been described as absolutely crucial in setting up LLM pipelines. (And with good reason.) Evidence shows that one well picked in-context example could have as much impact as dozens or hundreds of particular instances over which a model has been fine-tuned.

We’re in a perfect position to actually try this out.

Let’s take the base model once more, and modify the prompt so that it reads slightly differently:

CREATE MODEL ft_sql_succinct
PREDICT answer
engine = 'anyscale_endpoints',
model_name = 'mistralai/Mistral-7B-Instruct-v0.1',
prompt_template = 'Answer with the correct SQL query only, no explanations whatsoever. Here is the question: {{content}}';

If you query this model with the same content input, you will get: “SELECT COUNT(*) FROM employees WHERE country = 'Canada';”, which is precisely the effect we’re after.

This shows that for simple cases, good prompt templating can go a long way.

When is prompt engineering not enough?

Most LLM providers recommend you start with prompt engineering before performing any fine-tuning, and with good reason. We’ve just seen how effective it can be. When you fine-tune, you’re generating a new set of artifacts which makeup your new model, somewhere. In this case, the MindsDB-Anyscale stack handles the infrastructure overhead, but additional training and serving concerns are introduced (i.e. your solution is now more expensive to run) when compared to a single flexible prompt that can be used on base models at inference time.

The flipside is that if you have tons of data to fine-tune on, or if your behavioral constraints are such that prompt engineering won’t cut it, then fine-tunes become a much more attractive option. For example, your base model could have a particularly short token limit, let’s say 2K tokens total.

This means your prompt cannot include many examples, and depending on the nature of your LLM application, the intended behavior may require more data before the model is able to achieve this behavior. Or perhaps there is one such prompt that works, but it is so long that its added cost on inference is very high. In both these scenarios, it will make sense to fine-tune first (perhaps complemented with a good prompt).

On the other hand (and as previously mentioned) if what you need is recalling facts, then RAG is pretty much a necessary complement. It will massively help the model in avoiding fabricated information. We will explore RAG more closely in an upcoming guide.

Tips for getting started

As a final note, this fine-tuning run used 1.6 million tokens in our experiment. As of December 2023, the pricing for LLM fine tuning in Anyscale Endpoints works out to a grand total of $6.6 US dollars for the run. Not bad! Even so, when trying this out yourself, be careful to start small and set hard limits to avoid a huge unexpected bill at the end of the month.

We’ve seen how MindsDB and Anyscale Endpoints are a powerful combination to fine-tune open source large language models with your own data in a cost-effective and simple way. We’ve also explored the impact of fine tuning on a model’s behavior, and how it relates to prompt engineering. We hope this is useful, and good luck with your projects.

Check us out on GitHub.