Customizing and fine-tuning LLMs: What you need to know (2024)

How to write function in Python to reverse a string
How to write SQL query to select users from a database by age
How to implement binary search in Java

How often do you have to break the flow, leave your IDE, and search for answers to questions (that are maybe similar to the ones above)? And how often do you end up getting distracted and end up watching cat videos instead of getting back to work? (This happens to the best of them, even to GitHub’s VP of Developer Relations, Martin Woodward.)

It doesn’t have to be that way. A developer’s ability to get AI coding assistance directly in a workspace was found to reduce context switching and conserve a developer’s mental energy. When directly integrated into workspaces, these tools become familiar enough with a developer’s code to quickly provide tailored suggestions. Now, without getting sidetracked, developers can get customized answers to coding questions like:

Can you suggest a better way to structure my code for scalability?
Can you help me debug this function? It's not returning the expected results.
Can you help me understand this piece of code in this repository?

But how do AI coding assistants provide customized answers? What can organizations and developers do to receive more tailored solutions? And how, ultimately, do customized AI coding assistants benefit organizations as a whole?

We talked to Alireza Goudarzi, a senior machine learning researcher at GitHub, to get the answers. ⬇️

How AI coding assistants provide customized answers

When it comes to problem solving, context is everything.

Business decision makers use information gathered from internal metrics, customer meetings, employee feedback, and more to make decisions about what resources their companies need. Meanwhile, developers use details from pull requests, a folder in a project, open issues, and more to solve coding problems.

Large language models, or LLMs, do something similar:

Generative AI coding tools are powered by LLMs, which are sets of algorithms trained on large amounts of code and human language.
Today’s LLMs are structured as transformers, a kind of architecture that makes the model good at connecting the dots between data. Following the transformer architecture is what enables today’s LLMs to generate responses that are more contextually relevant than previous AI models.
Though transformer LLMs are good at connecting the dots, they need to learn what data to process and in what order.
A generative AI coding assistant in the IDE can be instructed to use data from open files or code written before and after the cursor to understand the context around the current line of code and suggest a relevant code completion.
As a chatbot in an IDE or on a website, a generative AI coding assistant can provide guidance by using data from indexed repositories, customized knowledge bases, developer-provided input in a prompt or query, and even search engine integrations.

All input data—the code, query, and additional context—passes through something called a context window, which is present in all transformer-based LLMs. The size of the context window represents the capacity of data an LLM can process. Though it can’t process an infinite amount of data, it can grow larger. But because that window is limited, prompt engineers have to figure out what data, and in what order, to feed the model so it generates the most useful, contextually relevant responses for the developer.

How to customize your LLM

Customizing an LLM is not the same as training it. Training an LLM means building the scaffolding and neural networks to enable deep learning. Customizing an LLM means adapting a pre-trained LLM to specific tasks, such as generating information about a specific repository or updating your organization’s legacy code into a different language.

There are a few approaches to customizing your LLM: retrieval augmented generation, in-context learning, and fine-tuning.

We broke these down in this post about the architecture of today’s LLM applications and how GitHub Copilot is getting better at understanding your code. Here’s a recap.

Retrieval-augmented generation (RAG)

RAG typically uses something called embeddings to retrieve information from a vector database. Vector databases are a big deal because they transform your source code into retrievable data while maintaining the code’s semantic complexity and nuance.

In practice, that means an LLM-based coding assistant using RAG can generate relevant answers to questions about a private repository or proprietary source code. It also means that LLMs can use information from external search engines to generate their responses.

If you’re wondering what a vector database is, we have you covered:

Vector databases store embeddings of your repository code and documentation. The embeddings are what make your code and documentation readable by an LLM. (This is similar to the way programming languages are converted into a binary system language for a computer to understand.)
As developers code in an IDE, algorithms transform code snippets in the IDE into embeddings. Algorithms then make approximate matches between the embeddings that are created for those IDE snippets and the embeddings already stored in the vector database.
When asking a question to a chat-based AI coding assistant, the questions and requests written in natural language are also transformed into embeddings. A similar process to the one described above takes place: the embeddings created for the natural language prompts are matched to embeddings already stored in vector databases.

Vector databases and embeddings allow algorithms to quickly search for approximate matches (not just exact ones) on the data they store. This is important because if an LLM’s algorithms only make exact matches, it could be the case that no data is included as context. Embeddings improve an LLM’s semantic understanding, so the LLM can find data that might be relevant to a developer’s code or question and use it as context to generate a useful response.

In-context learning

In-context learning, a method sometimes referred to as prompt engineering, is when developers give the model specific instructions or examples at the time of inference (also known as the time they’re typing or vocalizing a question or request). By providing these instructions and examples, the LLM understands the developer is asking it to infer what they need and will generate a contextually relevant output.

In-context learning can be done in a variety of ways, like providing examples, rephrasing your queries, and adding a sentence that states your goal at a high-level.

Discover more LLM prompting tips in our guide

Fine-tuning

Fine-tuning your model can result in a highly customized LLM that excels at a specific task. There are two ways to customize your model with fine-tuning: supervised learning and reinforcement learning from human feedback (RLHF).

Under supervised learning, there is a predefined correct answer that the model is taught to generate. Under RLHF, there is high-level feedback that the model uses to gauge whether its generated response is acceptable or not.

Let’s dive deeper.

Supervised learning

This method is when the model’s generated output is evaluated against an intended or known output. For example, you know that the sentiment behind a statement like this is negative: “This sentence is unclear.” To evaluate the LLM, you’d feed this sentence to the model and query it to label the sentiment as positive or negative.

If the model labels it as positive, then you’d adjust the model’s parameters (variables that can be weighed or prioritized differently to change a model’s output) and try prompting it again to see if it can classify the sentiment as negative.

But even smaller models can have over 300 million parameters. Those are a lot of variables to sift through and adjust (and re-adjust). This method also requires time-intensive labeling. Each input sample requires an output that’s labeled with exactly the correct answer, such as “Negative,” for the example above. That label gives the output something to measure against so adjustments can be made to the model’s parameters.

Reinforcement learning from human feedback (RLHF)

RLHF requires either direct human feedback or creating a reward model that’s trained to model human feedback (by predicting if a user will accept or reject the output from the pre-trained LLM). The learnings from the reward model are passed to the pre-trained LLM, which will adjust its outputs based on user acceptance rate.

The benefit to RLHF is that it doesn’t require supervised learning and, consequently, expands the criteria for what’s an acceptable output. For example, with enough human feedback, the LLM can learn that if there’s an 80% probability that a user will accept an output, then it’s fine to generate.

For more on LLMs and how they process data, read:

How to customize GitHub Copilot

GitHub Copilot’s contextual understanding has continuously evolved over time. The first version was only able to consider the file you were working on in your IDE to be contextually relevant. We then expanded the context to neighboring tabs, which are all the open files in your IDE that GitHub Copilot can comb through to find additional context.

Just a year and a half later, we launched GitHub Copilot Enterprise, which uses an organization’s indexed repositories to provide developers with coding assistance that’s customized to their codebases. With GitHub Copilot Enterprise, organizations can tailor GitHub Copilot suggestions in the following ways:

Index their source code repositories in vector databases, which improves semantic search and gives their developers a customized coding experience.
Create knowledge bases, which are Markdown files from a collection of repositories that provide GitHub Copilot with additional context through unstructured data, or data that doesn’t live in a database or spreadsheet.

In practice, this can benefit organizations in several ways:

Enterprise developers gain a deeper understanding of your organization’s unique codebase. Senior and junior developers alike can prompt GitHub Copilot for code summaries, coding suggestions, and answers about code behavior. As a result of this streamlined code navigation and comprehension, enterprise developers implement features, resolve issues, and modernize code faster.
Complex data is quickly translated into organizational knowledge and best practices. Because GitHub Copilot receives context through the repositories and documentation your organization chooses to index, developers receive coding suggestions and guidance that are more useful because they align with organizational knowledge and best practices.
It’s not just developers, but also their non-developer and cross-functional team members who can use natural language to prompt Copilot Chat in GitHub.com for answers and guidance on relevant documentation or existing solutions. Data and solutions captured in repositories becomes more accessible across the organization, improving collaboration and increasing awareness of business goals and practices.
Faster pull requests create smart, efficient, and accessible development workflows. With GitHub Copilot Enterprise, developers can use GitHub Copilot to generate pull request summaries directly in GitHub.com, helping them communicate clearly with reviewers while also saving valuable time. For developers reviewing pull requests, GitHub Copilot can be used to help them quickly gain a strong understanding of proposed changes and, as a result, focus more time on providing valuable feedback.

GitHub Copilot Enterprise is now generally available.

Read more about GitHub’s most advanced AI offering, and how it’s customized to your organization’s knowledge and codebase.

Best practices for customizing your LLM

Customized LLMs help organizations increase value out of all of the data they have access to, even if that data’s unstructured. Using this data to customize an LLM can reveal valuable insights, help you make data-driven decisions, and make enterprise information easier to find overall.

Here are our top tips for customizing an LLM.

Select an AI solution that uses RAG

Like we mentioned above, not all of your organization’s data will be contained in a database or spreadsheet. A lot of data comes in the form of text, like code documentation.

Organizations that opt into GitHub Copilot Enterprise will have a customized chat experience with GitHub Copilot in GitHub.com. GitHub Copilot Chat will have access to the organization’s selected repositories and knowledge base files (also known as Markdown documentation files) across a collection of those repositories.

Adopt innersource practices

Kyle Daigle, GitHub’s chief operating officer, previously shared the value of adapting communication best practices from the open source community to their internal teams in a process known as innersource. One of those best practices is writing something down and making it easily discoverable.

How does this practice pay off? It provides more documentation, which means more context for an AI tool to generate tailored solutions to our organization. Effective AI adoption requires establishing this foundation of context.

Moreover, developers can use GitHub Copilot Chat in their preferred natural language—from German to Telugu. That means more documentation, and therefore more context for AI, improves global collaboration. All of your developers can work on the same code while using their own natural language to understand and improve it.

Here are Daigle’s top tips for innersource adoption:

If you like what you hear, record it and make it discoverable (and remember: plenty of video and productivity tools now provide AI-powered summaries and action items).
If you come up with a useful solution for your team, share it out with the wider organization so they can benefit from it, too.
Offer feedback to publicly shared information and solutions. But remember to critique the work, not the person.
If you request a change to a project or document, explain why you’re requesting that change.

✨ Bonus points if you add all of these notes to your relevant GitHub repositories and format them in Markdown.

How do you expand your LLM results?

The answer lies in search engine integration.

Transformer-based LLMs have impressive semantic understanding even without embedding and high-dimensional vectors. This is because they’re trained on a large_ _amount of unlabeled natural language data and publicly available source code. They also use a self-supervised learning process where they use a portion of input data to learn basic learning objectives, and then apply what they’ve learned to the rest of the input.

When a search engine is integrated into an LLM application, the LLM is able to retrieve search engine results relevant to your prompt because of the semantic understanding it’s gained through its training. That means an LLM-based coding assistant with search engine integration (made possible through a search engine’s API) will have a broader pool of current information that it can retrieve information from.

Why does this matter to your organization?

Let’s say a developer asks an AI coding tool a question about the most recent version of Java. However, the LLM was trained on data from before the release, and the organization hasn’t updated its repositories’ knowledge with information about the latest release. The AI coding tool can still answer the developer’s question by conducting a web search to retrieve the answer.

A generative AI coding assistant that can retrieve data from both custom and publicly available data sources gives employees customized and comprehensive guidance.

The path forward

50% of enterprise software engineers are expected to use machine-learning powered coding tools by 2027, according to Gartner.

Today, developers are using AI coding assistants to get a head start on complex code translation tasks, build better test coverage, tackle new problems with creative solutions, and find answers to coding-related questions without leaving their IDEs. With customization, developers can also quickly find solutions tailored to an organization’s proprietary or private source code, and build better communication and collaboration with their non-technical team members.

In the future, we imagine a workspace that offers more customization for organizations. For example, your ability to fine-tune a generative AI coding assistant could improve code completion suggestions. Additionally, integrating an AI coding tool into your custom tech stack could feed the tool with more context that’s specific to your organization and from services and data beyond GitHub.

Get GitHub Copilot

Written by

Customizing and fine-tuning LLMs: What you need to know (1)

Nicole Choi

@nicchoi29

FAQs

Customizing and fine-tuning LLMs: What you need to know? ›

By fine-tuning a pre-trained LLM, you can customize it to better understand these unique aspects and generate content specific to your domain. This approach allows you to tailor the model's responses to align with your specific requirements, ensuring that it produces accurate and contextually relevant outputs.

Get More Info Here ›

How hard is it to fine-tune an LLM? ›

A large language model life cycle has several key steps, and today we're going to cover one of the juiciest and most intensive parts of this cycle - the LLM fine-tuning process. This is a laborious, heavy, but rewarding task that's involved in many language model training processes.

Get More Info ›

What are the methods of fine-tune LLM? ›

Fine-tuning an LLM (large language model) involves additional training of a pre-trained language model on a domain-specific dataset, enabling the model to generate more accurate and relevant text for specific applications.

Keep Reading ›

How long does fine-tuning LLM take? ›

Fine-tuning an LLM can take anywhere from 3 hours to a few days or even weeks. The time it takes depends on the amount of data, the number of parameters, and the complexity of the task the LLM is tuned for. Additionally, it depends on the base model being used.

Get More Info ›

Is LLM fine-tuning worth it? ›

Cost and latency reduction: For high-volume use cases, fine-tuning a smaller model can dramatically reduce costs and latency compared to using a large general-purpose model for each request. This can save a lot of tokens if you have a large system prompt with instructions for explaining the desired output.

Show Me More ›

Is an LLM valuable? ›

Does An LLM Really Help? Like we mentioned above, there really isn't a huge amount of data to indicate that LLMs are beneficial to lawyers. Sure, some employers would like to see them on a resume. However, a lot of employers, the vast majority in fact, simply don't give the degrees much weight.

When should you fine-tune LLMs? ›

Task-specific fine-tuning is particularly valuable when you want to optimize the model's performance for a single, well-defined task, ensuring that the model excels in generating task-specific content with precision and accuracy.

Discover More ›

How many examples to fine-tune LLM? ›

LLMs need vast amounts of data to learn effectively. Fine-tuning on small datasets often leads to overfitting. As a rule of thumb, 1,000 examples is an absolute minimum per task. Class and dataset imbalances can bias the model.

What is the difference between LLM fine-tuning and rag? ›

Model Customization: Fine-tuning offers more customization for writing style and behavior, while RAG focuses on information retrieval. Hallucinations: RAG is less prone to hallucinations, but fine-tuning can reduce hallucinations with domain-specific data.

Get More Info Here ›

What is a major problem with the fine-tuning argument? ›

The problem for the proponent of the fine tuning argument, however is that the proba- bility is too low! Such measures will yield a finite value for the interval [ν − δ, ν + ] and an infinite value for the whole real line.

Read The Full Story ›

What is fine-tuning for dummies? ›

Fine-tuning in machine learning is the process of adapting a pre-trained model for specific tasks or use cases. It has become a fundamental deep learning technique, particularly in the training process of foundation models used for generative AI.

Read On ›

What is the difference between pre training and fine-tuning LLM? ›

This article delves into the two predominant approaches: pre-training and fine-tuning. Pre-training involves teaching the model a broad understanding of language from massive datasets while fine-tuning adapts this knowledge to specific tasks or domains.

Learn More Now ›

What GPU is needed for LLM fine-tuning? ›

GPU Requirements: Alpaca's smaller and more efficient models can be run on GPUs with as little as 8GB VRAM, making it accessible for entry-level hardware. For optimal performance, especially when fine-tuning on larger datasets, GPUs with 16GB VRAM or more, such as the RTX 4080 or 4090, are recommended.

Keep Reading ›

How much RAM to fine-tune an LLM? ›

7 billion parameter models can be finetuned efficiently within a few hours on a single GPU possessing 14 GB of RAM. With a static dataset, optimizing an LLM to excel across all benchmark tasks is unattainable.

How do you tune your own LLM? ›

Now let's explore how we can fine-tune LLM on a custom dataset using QLoRA on a single GPU.

Setting up the NoteBook.
Install required libraries.
Loading dataset.
Create Bitsandbytes configuration.
Loading the Pre-Trained model.
Tokenization.
Test the Model with Zero Shot Inferencing.
Pre-processing dataset.

More items...

Jan 24, 2024

View Details ›

Is it possible to fine-tune ChatGPT? ›

This involves feeding the formatted dataset into the code and specifying the fine-tuning technique to use. The code will then fine-tune LLMs like ChatGPT on the dataset and generate a fine-tuned model.

How to do fine-tuning in transfer learning? ›

Transfer learning, fine-tuning and hyperparameter tuning

Un-freeze some top layers of the model.
Compile the model.
Continue training the model.
Evaluation and prediction.