Data augmentation and model distillation solution for LLMs - Platform For AI

The training and inference process of large language models (LLMs) involves high energy consumption and long response time. These challenges limit the deployment of LLMs in resource-constrained environments. To meet these challenges, Platform for AI (PAI) provides the model distillation feature. This feature facilitates the transfer of knowledge from LLMs to small models, which significantly reduces the model size and computing resource requirements while retaining most of the performance. This way, more actual application scenarios are supported. This topic uses the Qwen2 model to describe how to develop a data augmentation and model distillation solution for LLMs.

Working process

The following descriptions provide the complete development process of the solution:

Prepare instruction data
You can prepare a training dataset based on the data format requirements and data preparation strategies.
(Optional) Use an instruction augmentation model
You can use the preset instruction augmentation model Qwen2-1.5B-Instruct-Exp or Qwen2-7B-Instruct-Exp in Model Gallery of Platform for AI (PAI) to automatically augment similar instructions based on the semantics of instructions in the training dataset that you prepared. Instruction augmentation helps improve the generalization of distillation training for LLMs.
(Optional) Use an instruction optimization model
You can use the preset instruction optimization model Qwen2-1.5B-Instruct-Refine or Qwen2-7B-Instruct-Refine in Model Gallery of PAI to optimize the instructions and augmented instructions in the training dataset that you prepared. Instruction optimization helps improve the language generation capabilities of LLMs.
Deploy a teacher model service to generate a response
You can use the preset teacher LLM Qwen2-72B-Instruct in Model Gallery of PAI to generate a response to the instructions in the training dataset and distill the knowledge of the teacher LLM.
Distill and train a student model
You can use the generated instruction-response dataset in Model Gallery of PAI to distill and train a smaller student model in actual application scenarios.

Prerequisites

Before you perform the operations that are described in this topic, make sure that you have made the following preparations:

PAI-DLC and PAI-EAS are activated and the default workspace is created. For more information, see Activate PAI and create a default workspace.
An OSS bucket is created to store training data and the model file. For more information, see Create buckets.

Prepare instruction data

For information about how to prepare instruction data, see Data preparation strategies and Data format requirements.

Data preparation strategies

To improve the effectiveness and stability of model distillation, you can prepare instruction data based on the following strategies:

Prepare at least hundreds of data records. A large amount of data helps improve the effectiveness of a model.
Make sure that the training dataset has a broad and balanced distribution. For example, task scenarios are diverse, and input and output data covers both short and long forms. If the data involves multiple languages, such as Chinese and English, make sure that the language distribution is balanced.
Handle abnormal data. A small amount of abnormal data can have a great impact on the fine-tuning effectiveness. We recommend that you cleanse the data based on rules and filter out abnormal data in the training dataset.

Data format requirements

A training dataset is a JSON file that contains the instruction field. The following sample code provides an example of instruction data:

[
    {
        "instruction": "What were the main measures taken by governments to stabilize financial markets during the 2008 financial crisis?"
    },
    {
        "instruction": "In the context of increasing climate change, what important actions have governments taken to promote sustainable development?"
    },
    {
        "Instruction": "What were the main measures taken by governments to support economic recovery during the bursting of the tech bubble in 2001?"
    }
]

(Optional) Use an instruction augmentation model

Instruction augmentation is a common technology of prompt engineering for LLMs. Instruction augmentation is used to automatically expand a user-provided training dataset, thereby achieving data augmentation.

For example, you can provide the following inputs:

How to make fish-fragrant shredded pork? 
How to prepare for the Graduate Record Examination (GRE) exam? 
What can I do if I am misunderstood by a friend?

The model returns the following results:

Teach me how to make mapo tofu? 
Provide a detailed guide on how to prepare for the Test of English as a Foreign Language (TOEFL) exam? 
How will you adjust your mindset if you encounter setbacks in your work?

The diversity of instructions affects the generalization of the learning of LLMs. Instruction augmentation can help improve the effectiveness of the generated student models in an efficient manner. PAI provides the following independently developed instruction augmentation models based on the Qwen2 base model: Qwen2-1.5B-Instruct-Exp and Qwen2-7B-Instruct-Exp. You can deploy a model service in Model Gallery of PAI with a few clicks.

Deploy a model service

You can deploy an instruction augmentation model as an online service in Elastic Algorithm Service (EAS) by performing the following steps:

Go to the Model Gallery page.
1. Log on to the PAI console.
2. In the upper-left corner, select a region based on your business requirements.
3. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.
4. In the left-side navigation pane, choose QuickStart > Model Gallery.
In the model list of the Model Gallery page, search for Qwen2-1.5B-Instruct-Exp or Qwen2-7B-Instruct-Exp and click Deploy in the desired model card.
In the Deploy panel, the parameters in the Model Service Information and Resource Deployment Information sections are automatically configured. You can modify the parameters based on your business requirements. After you configure the parameters, click Deploy.
In the Billing Notification message, click OK.
The service details page appears. When the value of the Status parameter changes to Running, the model service is deployed.

Call a model service

After you deploy a model service, you can use an API for model inference. For more information, see Deploy an LLM as a service. The following example shows how to initiate a model service calling request by using the client:

Obtain the endpoint and token of the model service.
1. In the Basic Information section of the service details page, click View Call Information.
2. In the Call Information dialog box, view and save the endpoint and token of the model service to your on-premises machine.

In the terminal, create and execute the following Python code file to call the model service:

import argparse
import json
import requests
from typing import List

def post_http_request(prompt: str,
                      system_prompt: str,
                      host: str,
                      authorization: str,
                      max_new_tokens: int,
                      temperature: float,
                      top_k: int,
                      top_p: float) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "eos_token_id": 151645
    }
    response = requests.post(host, headers=headers, json=pload)
    return response

def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    return output

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=50)
    parser.add_argument("--top-p", type=float, default=0.95)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=1)
    parser.add_argument("--prompt", type=str, default="Sing me a song.")

    args = parser.parse_args()
    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    temperature = args.temperature
    max_new_tokens = args.max_new_tokens

    host = "EAS HOST"
    authorization = "EAS TOKEN"

    print(f" --- input: {prompt}\n", flush=True)
    system_prompt = "I want you to play the role of an instruction creator.   Your goal is to take inspiration from [a given instruction] and create an instruction."
    response = post_http_request(
        prompt, system_prompt,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p)
    output = get_response(response)
    print(f" --- output: {output}\n", flush=True)

Take note of the following parameters:

host: the endpoint of your model service.
authorization: the token of your model service.

Augment multiple instructions at a time

You can call the preceding EAS model service to augment instructions at a time. The following sample code provides an example on how to read a custom JSON training dataset and call the preceding model service for instruction augmentation. You need to create and execute the following Python code file in the terminal to call the model service:

import requests
import json
import random
from tqdm import tqdm
from typing import List

input_file_path = "input.json"  # The name of the input file.
with open(input_file_path) as fp:
    data = json.load(fp)

total_size = 10  # The expected number of data records after expansion.
pbar = tqdm(total=total_size)

while len(data) < total_size:
    prompt = random.sample(data, 1)[0]["instruction"]
    system_prompt = "I want you to play the role of an instruction creator.   Your goal is to take inspiration from [a given instruction] and create an instruction."
    top_k = 50
    top_p = 0.95
    temperature = 1
    max_new_tokens = 2048

    host = "EAS HOST"
    authorization = "EAS TOKEN"

    response = post_http_request(
        prompt, system_prompt,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p)
    output = get_response(response)
    temp = {
        "instruction": output
    }
    data.append(temp)
    pbar.update(1)
pbar.close()

output_file_path = "output.json" # The name of the output file.
with open(output_file_path, 'w') as f:
    json.dump(data, f, ensure_ascii=False)

Take note of the following parameters:

host: the endpoint of your model service.
authorization: the token of your model service.
file_path: the on-premises path of the dataset file.
The definitions of the post_http_request and get_response functions are the same as the definitions of the corresponding functions in the Python script in Call a model service.

You can also use the LLM-GenerateInstructionData (DLC) component of Machine Learning Designer to augment instructions. For more information, see Custom pipelines.

(Optional) Use an instruction optimization model

Instruction optimization is another common technology of prompt engineering in LLMs. Instruction optimization is used to automatically optimize a user-provided training dataset to generate more detailed instructions. The detailed instructions enable LLMs to return detailed responses.

For example, you can provide the following inputs in an instruction optimization model:

How to make fish-fragrant shredded pork? 
How to prepare for the GRE exam? 
What can I do if I am misunderstood by a friend?

The model returns the following results:

Provide a detailed recipe of Chinese Sichuan-style fish-fragrant shredded pork. The recipe contains a list of specific ingredients, such as vegetables, pork, and spices, and detailed cooking instructions. If possible, recommend side dishes and other main courses that pair well with this dish. 
Provide a detailed guide, including registration for the GRE test, required materials, test preparation strategies, and recommended review materials. If possible, recommend some effective practice questions and mock exams to help me prepare for the exam. 
Provide a detailed guide to teach me how to be calm and rational, and communicate effectively to solve the problem when I am misunderstood by my friends. Provide some practical suggestions, such as how to express my thoughts and feelings, and how to avoid aggravating misunderstandings, and provide specific dialogue scenarios and situations so that I can better understand and practice.

The level of detail in the instructions affects the outputs of LLMs. Instruction optimization can improve the effectiveness of the generated student models in an efficient manner. PAI provides the following independently developed instruction optimization models based on the Qwen2 base model: Qwen2-1.5B-Instruct-Refine and Qwen2-7B-Instruct-Refine. You can deploy a model service in Model Gallery of PAI with a few clicks.

Deploy a model service

Go to the Model Gallery page.
1. Log on to the PAI console.
2. In the upper-left corner, select a region based on your business requirements.
3. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.
4. In the left-side navigation pane, choose QuickStart > Model Gallery.
In the model list of the Model Gallery page, search for Qwen2-1.5B-Instruct-Refine or Qwen2-7B-Instruct-Refine and click Deploy in the desired model card.
In the Deploy panel, the parameters in the Model Service Information and Resource Deployment Information sections are automatically configured. You can modify the parameters based on your business requirements. After you configure the parameters, click Deploy.
In the Billing Notification message, click OK.
The service details page appears. When the value of the Status parameter changes to Running, the model service is deployed.

Call a model service

Obtain the endpoint and token of the model service.
1. In the Basic Information section of the service details page, click View Call Information.
2. In the Call Information dialog box, view and save the endpoint and token of the model service to your on-premises machine.

In the terminal, create and execute the following Python code file to call the model service:

import argparse
import json
import requests
from typing import List


def post_http_request(prompt: str,
                      system_prompt: str,
                      host: str,
                      authorization: str,
                      max_new_tokens: int,
                      temperature: float,
                      top_k: int,
                      top_p: float) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "eos_token_id": 151645
    }
    response = requests.post(host, headers=headers, json=pload)
    return response


def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    return output


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=2)
    parser.add_argument("--top-p", type=float, default=0.95)
    parser.add_argument("--max-new-tokens", type=int, default=256)
    parser.add_argument("--temperature", type=float, default=0.5)
    parser.add_argument("--prompt", type=str, default="Sing me a song.")

    args = parser.parse_args()
    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    temperature = args.temperature
    max_new_tokens = args.max_new_tokens

    host = "EAS HOST"
    authorization = "EAS TOKEN"

    print(f" --- input: {prompt}\n", flush=True)
    system_prompt = "Optimize this instruction and change it to a more detailed and specific instruction."
    response = post_http_request(
        prompt, system_prompt,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p)
    output = get_response(response)
    print(f" --- output: {output}\n", flush=True)

Take note of the following parameters:

host: the endpoint of your model service.
authorization: the token of your model service.

Optimize multiple instructions at a time

You can call the preceding EAS model service to optimize multiple instructions at a time. The following sample code provides an example on how to read a custom JSON training dataset and call the preceding model service for instruction optimization. You need to create and execute the following Python code file in the terminal to call the model service:

import requests
import json
import random
from tqdm import tqdm
from typing import List

input_file_path = "input.json"  # The name of the input file.

with open(input_file_path) as fp:
    data = json.load(fp)

pbar = tqdm(total=len(data))
new_data = []

for d in data:
    prompt = d["instruction"]
    system_prompt = "Optimize the following instruction."
    top_k = 50
    top_p = 0.95
    temperature = 1
    max_new_tokens = 2048

    host = "EAS HOST"
    authorization = "EAS TOKEN"

    response = post_http_request(
        prompt, system_prompt,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p)
    output = get_response(response)
    temp = {
        "instruction": output
    }
    new_data.append(temp)
    pbar.update(1)
pbar.close()

output_file_path = "output.json"  # The name of the output file.
with open(output_file_path, 'w') as f:
    json.dump(new_data, f, ensure_ascii=False)

Take note of the following parameters:

host: the endpoint of your model service.
authorization: the token of your model service.
file_path: the on-premises path of the dataset file.
The definitions of the post_http_request and get_response functions are the same as the definitions of their corresponding functions in the Python script in Call a model service.

You can also use the LLM-OptimizeInstructionData (DLC) component of Machine Learning Designer to optimize instructions. For more information, see Custom pipelines.

Deploy a teacher model service to generate a response

Deploy a model service

After you optimize the instructions in the training dataset, you can deploy a teacher model service to generate a response by performing the following steps:

Go to the Model Gallery page.
1. Log on to the PAI console.
2. In the upper-left corner, select a region based on your business requirements.
3. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.
4. In the left-side navigation pane, choose QuickStart > Model Gallery.
In the model list of the Model Gallery page, search for Qwen2-72B-Instruct and click Deploy in the model card.
In the Deploy panel, the parameters in the Model Service Information and Resource Deployment Information sections are automatically configured. You can modify the parameters based on your business requirements. After you configure the parameters, click Deploy.
In the Billing Notification message, click OK.
The service details page appears. When the value of the Status parameter changes to Running, the model service is deployed.

Call a model service

Obtain the endpoint and token of the model service.
1. In the Basic Information section of the service details page, click View Call Information.
2. In the Call Information dialog box, view and save the endpoint and token of the model service to your on-premises machine.

In the terminal, create and execute the following Python code file to call the model service:

import argparse
import json
import requests
from typing import List


def post_http_request(prompt: str,
                      system_prompt: str,
                      host: str,
                      authorization: str,
                      max_new_tokens: int,
                      temperature: float,
                      top_k: int,
                      top_p: float) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
    }
    response = requests.post(host, headers=headers, json=pload)
    return response


def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    return output


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=50)
    parser.add_argument("--top-p", type=float, default=0.95)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.5)
    parser.add_argument("--prompt", type=str)
    parser.add_argument("--system_prompt", type=str)

    args = parser.parse_args()
    prompt = args.prompt
    system_prompt = args.system_prompt
    top_k = args.top_k
    top_p = args.top_p
    temperature = args.temperature
    max_new_tokens = args.max_new_tokens

    host = "EAS HOST"
    authorization = "EAS TOKEN"

    print(f" --- input: {prompt}\n", flush=True)
    response = post_http_request(
        prompt, system_prompt,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p)
    output = get_response(response)
    print(f" --- output: {output}\n", flush=True)

Take note of the following parameters:

host: the endpoint of your model service.
authorization: the token of your model service.

Label multiple instructions of a teacher model service at a time

The following sample code provides an example on how to read a custom JSON training dataset and call the preceding model service to label multiple instructions at a time. You need to create and execute the following Python code file in the terminal to call the model service:

import json 
from tqdm import tqdm
import requests
from typing import List

input_file_path = "input.json"  # The name of the input file.

with open(input_file_path) as fp:
    data = json.load(fp)

pbar = tqdm(total=len(data))
new_data = []

for d in data:
    system_prompt = "You are a helpful assistant."
    prompt = d["instruction"]
    print(prompt)
    top_k = 50
    top_p = 0.95
    temperature = 0.5
    max_new_tokens = 2048

    host = "EAS HOST"
    authorization = "EAS TOKEN"

    response = post_http_request(
        prompt, system_prompt,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p)
    output = get_response(response)
    temp = {
        "instruction": prompt,
        "output": output
    }
    new_data.append(temp)
    pbar.update(1)
pbar.close()

output_file_path = "output.json" # The name of the output file.
with open(output_file_path, 'w') as f:
    json.dump(new_data, f, ensure_ascii=False)

Take note of the following parameters:

host: the endpoint of your model service.
authorization: the token of your model service.
file_path: the on-premises path of the dataset file.
The definitions of the post_http_request and get_response functions are the same as the definitions of the corresponding functions in the Python script in Call a model service.

Distill and train a student model

Train a model

After you obtain a response from a teacher model service, you can train a student model in Model Gallery of PAI without the need to write code. This greatly simplifies the model development process. In this example, the Qwen2-7B-Instruct model is used to describe how to use the prepared training data to train a model in Model Gallery of PAI. Perform the following steps:

Go to the Model Gallery page.
1. Log on to the PAI console.
2. In the upper-left corner, select a region based on your business requirements.
3. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.
4. In the left-side navigation pane, choose QuickStart > Model Gallery.
In the model list of the Model Gallery page, search for and click the Qwen2-7B-Instruct model.
In the upper-right corner of the model details page, click Train.

In the Train panel, configure the key parameters described in the following table. Use the default settings of other parameters.

Parameter		Description	Default value
Dataset Configuration	Training dataset	Select OSS file or directory from the drop-down list and perform the following steps to select the Object Storage Service (OSS) directory of the dataset file: Click the icon. In the Select OSS file dialog box, select an OSS bucket. Click Upload File and follow the instructions to upload a dataset file to the OSS directory. Click OK.	None
ModelOutput Path	model	Click the icon to select an OSS directory.	None
ModelOutput Path	tensorboard	Click the icon to select an OSS directory.	None
Computing resources	Job Resource	Configure resource specifications. The system recommends appropriate resource specifications.	None
Hyper-parameters	learning_rate	The learning rate during model training. The value must be of the FLOAT type.	5e-5
	num_train_epochs	The number of training epochs. The value must be of the INT type.	1
	per_device_train_batch_size	The amount of data used by each GPU in one training iteration. The value must be of the INT type.	1
	seq_length	The length of the text sequence. The value must be of the INT type.	128
	lora_dim	The inner dimensions of the low-rank matrices that are used in Low-Rank Adaptation (LoRA) or QLoRA training. The value must be of the INT type. Set this parameter to a value greater than 0.	32
	lora_alpha	The LoRA or QLoRA weights. The value must be of the INT type. This parameter takes effect only if you set the lora_dim parameter to a value greater than 0.	32
	load_in_4bit	Specifies whether to load the model in 4-bit quantization. The value must be of the BOOLEAN type. Valid values: true false This parameter takes effect only if you set the lora_dim parameter to a value greater than 0 and the load_in_8bit parameter to false.	true
	load_in_8bit	Specifies whether to load the model in 8-bit quantization. The value must be of the BOOLEAN type. Valid values: true false This parameter takes effect only if you set the lora_dim parameter to a value greater than 0 and the load_in_4bit parameter to false.	false
	gradient_accumulation_steps	The number of gradient accumulation steps. The value must be of the INT type.	8
	apply_chat_template	Specifies whether the algorithm combines the training data with the default chat template to optimize the model output. The value must be of the BOOLEAN type. Valid values: true false In this example, a Qwen2 model is used in the following format: Question: `<\|im_end\|>\n<\|im_start\|>user\n + instruction + <\|im_end\|>\n` Answer: `<\|im_start\|>assistant\n + output + <\|im_end\|>\n`	true
	system_prompt	The system prompt used to train the model. The value must be of the STRING type.	You are a helpful assistant

After you configure the parameters, click Train.
In the Billing Notification message, click OK.
The training job details page appears.

Deploy a model service

After you train a model, you can deploy the model as an online service in EAS by performing the following steps:

In the upper-right corner of the training job details page, click Deploy.
In the Deploy panel, the parameters in the Model Service Information and Resource Deployment Information sections are automatically configured. You can modify the parameters based on your business requirements. After you configure the parameters, click Deploy.
In the Billing Notification message, click OK.
The service details page appears. When the value of the Status parameter changes to Running, the model service is deployed.

Call a model service

After you deploy a model service, you can use an API for model inference. For more information, see Deploy an LLM as a service.

References

For more information about EAS, see EAS overview.
You can use Model Gallery of PAI to train and deploy models in different scenarios, including Llama-3, Qwen1.5, and Stable Diffusion V1.5. For more information, see Scenario-specific practices.