Constitutional AI

Fine-Tuning GPT-2 for Code Generation with Constitutional AI

May 26, 2024

In recent years, large language models like GPT-2 have shown remarkable capabilities in various natural language processing tasks. However, their potential for code generation has been relatively unexplored. In this blog post, we'll dive into how we can fine-tune the GPT-2 model for generating code snippets based on given instructions, while incorporating principles of Constitutional AI to ensure the generated code aligns with desired behaviors and constraints.

What is Constitutional AI?

Constitutional AI is an approach that aims to align AI systems with a set of predefined principles or "constitution" that guides their behavior. By incorporating these principles during the training process, we can steer the model to generate outputs that adhere to certain guidelines, such as producing safe, ethical, and reliable code.

Fine-Tuning GPT-2 for Code Generation

To fine-tune GPT-2 for code generation, we'll use the powerful Hugging Face Transformers library in PyTorch. Here's a step-by-step breakdown of the process:

Load the Dataset: We start by loading a dataset containing instructions, input examples, and corresponding code snippets. In this example, we use the "iamtarun/python_code_instructions_18k_alpaca" dataset from the Hugging Face Datasets library.
Format the Data: We define functions to format the dataset into a suitable format for training. Each data point consists of an instruction, an input example, and the expected code output. We concatenate these elements into a single string, separating them with special tokens.
Load GPT-2 Model and Tokenizer: We load the pre-trained GPT-2 medium model and its associated tokenizer using the Transformers library. We also add a special padding token to handle variable-length sequences.
Encode the Data: We encode the formatted dataset using the GPT-2 tokenizer, converting the text into numerical representations that the model can understand. We truncate and pad the sequences to ensure consistent lengths.
Create a Custom Dataset and DataLoader: We define a custom PyTorch Dataset class to encapsulate the encoded data and create a DataLoader for efficient batch processing during training.
Fine-Tune the Model: We fine-tune the GPT-2 model using the prepared dataset. We use the AdamW optimizer and train for a specified number of epochs. To handle memory constraints, we employ gradient accumulation, where we accumulate gradients over multiple batches before updating the model parameters.
Generate Code: After fine-tuning, we can use the trained model to generate code snippets based on given instructions. We define a generate_code function that takes a prompt as input and generates the corresponding code using the fine-tuned GPT-2 model.

!pip install transformers datasets torch tqdm

import torch

from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW

from datasets import load_dataset

from torch.utils.data import Dataset, DataLoader

from tqdm import tqdm

# Check if MPS (Metal Performance Shaders) is available for Apple Silicon; if not, use CUDA or CPU

device = torch.device("mps" if torch.has_mps else "cuda" if torch.cuda.is_available() else "cpu")

print(f"Using device: {device}")

Incorporating Constitutional AI Principles

To align the generated code with desired principles, we can incorporate Constitutional AI techniques during the fine-tuning process. Here are a few examples:

Filtering the Dataset: We can curate the training dataset to include only code snippets that adhere to specific coding standards, best practices, or ethical guidelines. This helps the model learn from high-quality and principled code examples.
Prompt Engineering: We can carefully craft the instructions and prompts to guide the model towards generating code that follows certain principles. For example, we can include prompts that emphasize code readability, efficiency, or security considerations.
Post-Processing and Validation: After generating code, we can apply post-processing steps to validate and refine the generated snippets. This can involve running static code analysis tools, checking for common vulnerabilities, or verifying adherence to coding conventions.

Code

!pip install transformers datasets torch tqdm

add Codeadd Markdown

[2]:

import torch

from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW

from datasets import load_dataset

from torch.utils.data import Dataset, DataLoader

from tqdm import tqdm

# Check if MPS (Metal Performance Shaders) is available for Apple Silicon; if not, use CUDA or CPU

device = torch.device("mps" if torch.has_mps else "cuda" if torch.cuda.is_available() else "cpu")

print(f"Using device: {device}")

Using device: cuda

/tmp/ipykernel_34/1287680388.py:8: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  device = torch.device("mps" if torch.has_mps else "cuda" if torch.cuda.is_available() else "cpu")

add Codeadd Markdown

[3]:

# Load dataset

dataset = load_dataset("iamtarun/python_code_instructions_18k_alpaca")['train']

# Sample data

print("Sample Data:")

print(dataset[0])

Downloading readme: 100%

905/905 [00:00<00:00, 76.9kB/s]

Downloading data: 100%|██████████| 11.4M/11.4M [00:00<00:00, 38.9MB/s]

Generating train split: 100%

18612/18612 [00:00<00:00, 95128.66 examples/s]

Sample Data:
{'instruction': 'Create a function to calculate the sum of a sequence of integers.', 'input': '[1, 2, 3, 4, 5]', 'output': '# Python code\ndef sum_sequence(sequence):\n  sum = 0\n  for num in sequence:\n    sum += num\n  return sum', 'prompt': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a function to calculate the sum of a sequence of integers.\n\n### Input:\n[1, 2, 3, 4, 5]\n\n### Output:\n# Python code\ndef sum_sequence(sequence):\n  sum = 0\n  for num in sequence:\n    sum += num\n  return sum'}

add Codeadd Markdown

play_arrow

# Function to format data for training

def format_data(instruction, input_text, output_text):

    user_prompt = f"Instruction: {instruction}\nInput: {input_text}"

    assistant_response = f"Output: {output_text}"

    return user_prompt, assistant_response

# Function to format prompts

def format_prompt(messages):

    return "\n".join([msg['content'] for msg in messages])

# Formatting dataset

formatted_data = [format_data(item['instruction'], item['input'], item['output']) for item in dataset]

print("Formatted Data Sample:")

print(formatted_data[0])

Formatted Data Sample:
('Instruction: Create a function to calculate the sum of a sequence of integers.\nInput: [1, 2, 3, 4, 5]', 'Output: # Python code\ndef sum_sequence(sequence):\n  sum = 0\n  for num in sequence:\n    sum += num\n  return sum')

add Codeadd Markdown

[5]:

# Load GPT-2 model and tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")

model = GPT2LMHeadModel.from_pretrained("gpt2-medium").to(device)

# Adding special tokens

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

tokenizer_config.json: 100%

26.0/26.0 [00:00<00:00, 2.01kB/s]

vocab.json: 100%

1.04M/1.04M [00:00<00:00, 5.96MB/s]

merges.txt: 100%

456k/456k [00:00<00:00, 5.32MB/s]

tokenizer.json: 100%

1.36M/1.36M [00:00<00:00, 6.29MB/s]

config.json: 100%

718/718 [00:00<00:00, 56.4kB/s]

model.safetensors: 100%

1.52G/1.52G [00:07<00:00, 197MB/s]

generation_config.json: 100%

124/124 [00:00<00:00, 9.45kB/s]

[5]:

add Codeadd Markdown

[6]:

# Create dataset class

class DatasetClass(Dataset):

    def __init__(self, encodings):

        self.encodings = encodings

    def __getitem__(self, idx):

        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):

        return len(self.encodings.input_ids)

# Encode data

train_encodings = tokenizer([f"{q} {tokenizer.eos_token} {a}" for q, a in formatted_data], truncation=True, padding=True)

# Create DataLoader

train_dataset = DatasetClass(train_encodings)

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)

add Codeadd Markdown

[9]:

# Function to format data for training

def format_data(instruction, input_text, output_text):

    user_prompt = f"Instruction: {instruction}\nInput: {input_text}"

    assistant_response = f"Output: {output_text}"

    return user_prompt, assistant_response

# Function to format prompts

def format_prompt(messages):

    return "\n".join([msg['content'] for msg in messages])

# Formatting dataset

formatted_data = [format_data(item['instruction'], item['input'], item['output']) for item in dataset]

print("Formatted Data Sample:")

print(formatted_data[0])

# Load GPT-2 model and tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")

model = GPT2LMHeadModel.from_pretrained("gpt2-medium").to(device)

# Adding special tokens

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model.resize_token_embeddings(len(tokenizer))

# Create dataset class

class DatasetClass(Dataset):

    def __init__(self, encodings):

        self.encodings = encodings

    def __getitem__(self, idx):

        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):

        return len(self.encodings.input_ids)

# Encode data

train_encodings = tokenizer([f"{q} {tokenizer.eos_token} {a}" for q, a in formatted_data], truncation=True, padding=True, return_tensors='pt')

# Check the encoding output to ensure indices are in range

print("Sample Encoded Data:")

print(train_encodings.input_ids[0])

# Create DataLoader

train_dataset = DatasetClass(train_encodings)

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

# Fine-tuning parameters

optimizer = AdamW(model.parameters(), lr=1e-5)

epochs = 3

accumulation_steps = 8

model.train()

for epoch in range(epochs):

    epoch_loss = 0

    for i, batch in enumerate(tqdm(train_loader), start=1):

        input_ids = batch['input_ids'].to(device)

        attention_mask = batch['attention_mask'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)

        loss = outputs.loss

        loss = loss / accumulation_steps

        loss.backward()

        epoch_loss += loss.item()

        if (i % accumulation_steps) == 0:

            optimizer.step()

            optimizer.zero_grad()

        # Free up memory

        del input_ids, attention_mask, outputs, loss

        torch.cuda.empty_cache()

    print(f"Epoch {epoch + 1}/{epochs} Loss: {epoch_loss:.4f}")

# Put the model in evaluation mode

model.eval()

# Function to generate code based on a prompt

def generate_code(prompt, max_length=125):

    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    outputs = model.generate(inputs['input_ids'], max_length=max_length, pad_token_id=tokenizer.eos_token_id)

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return generated_text

# Testing the model

test_prompt = "Write me a python function that adds 2 numbers together."

generated_code = generate_code(test_prompt)

print(f"Generated Code:\n{generated_code}")

# Function for interactive code generation

def interactive_code_generation():

    while True:

        prompt = input("Enter instruction (or 'exit' to stop): ")

        if prompt.lower() == 'exit':

            break

        generated_code = generate_code(prompt)

        print(f"Generated Code:\n{generated_code}\n")

interactive_code_generation()

Formatted Data Sample:
('Instruction: Create a function to calculate the sum of a sequence of integers.\nInput: [1, 2, 3, 4, 5]', 'Output: # Python code\ndef sum_sequence(sequence):\n  sum = 0\n  for num in sequence:\n    sum += num\n  return sum')

Conclusion

Fine-tuning GPT-2 for code generation using Constitutional AI principles offers exciting possibilities for automating code development while ensuring the generated code aligns with desired behaviors and constraints. By leveraging the power of large language models and incorporating principled training techniques, we can create AI systems that generate high-quality, safe, and reliable code.However, it's important to note that generated code should always be reviewed and tested thoroughly before deployment in production environments. While Constitutional AI can guide the model towards generating principled code, it's ultimately the responsibility of developers to ensure the correctness and security of the generated code.

Al-Ekram’s Substack

Discussion about this post

Ready for more?