I want to explore ways to enhance the content review process using a large language model (LLM), a.k.a. AI for peer reviewing. Last week, I tested this idea, and here is what I learned.
Although AI itself will continue to change, the basic concepts remain constant, even as tools and techniques may become outdated. In my quest to augment peer-reviewing technical content using LLMs, I began experimenting with existing tools. The idea is to have an app or service that checks content for style issues and suggests updates.
Some models give better output than others. My main focus, however, was understanding available techniques, applying them to my use case, and then comparing the results for these techniques rather than comparing LLMs. I used only publicly available LLMs that I could run directly on my local machine.
Note that I only list a single example. However, I have tried multiple prompt variations and different models to get a better response.
Here are some of the models, I tested:
mistral:7b-instruct
llama3:8b
phi3:3.8b-mini-instruct-4k-fp16
Table of contents
Open Table of contents
Experiment 1: Prompting techniques
I used a system prompt to define the context for the model and asked it to check the content for style issues. We use the IBM Style Guide and the Red Hat Supplementary Style Guide for technical content, so I used that as a reference. My assumption was that the model would be able to identify and suggest corrections for style issues based on the prompt. I assumed the creators had already trained the model on these style guides.
Prompt |
---|
As a technical editor, review the [Input] text and create a bulleted list of any grammar or stylistic issues you identify, following the guidelines from the IBM Style Guide and the Red Hat Supplementary Style Guide. List only the issues without additional commentary. Then, as a Technical Writer familiar with the IBM Style Guide and the Red Hat Supplementary Style Guide, revise the [Input] text to address the identified issues. Provide only the suggested text. |
Input |
Previously, the net.core.busy_read , net.core.busy_poll , and kernel.numa_balancing sysctl parameters did not exist within the RT kernel and were therefore unsupported. With this release, the openshift-node-performance-rt profile is added and included if an RT kernel is detected, which drops the unsupported kernel parameters before they are applied. |
Response |
- Use lowercase letters for acronyms and abbreviations unless they are part of a title or heading - Use proper capitalization in headings: “With this release, the OpenShift-node-performance-RT profile is added…” -Add article to “if an RT kernel is detected”: “If an RT kernel is detected…” Issues: - Incorrect use of capitals and lowercase in net.core.busy_read, net.core.busy_poll, and kernel.numa_balancing (use all lowercase) - Missing articles before “these release” and “which drops” - Inconsistent use of hyphen: use en dash instead for range of numbers, e.g., “RT kernel is detected, which drops the unsupported net.core.busy_read, net.core.busy_poll, and kernel.numa-balancing sysctl parameters before they are applied.” With this release, if an RT kernel is detected, the OpenShift-node-performance-RT profile is added and included. This drops the unsupported net.core.busy_read, net.core.busy_poll, and kernel.numa-balancing sysctl parameters before they are applied. |
Observations
Based on the output, the model identified some style issues. However, it was inconsistent. Sometimes, it even made up issues when there were none. Because of these inconsistencies, relying on prompting techniques alone doesn’t work when reviewing content.
Experiment 2: Retrieval Augmented Generation (RAG) techniques
Next, I tried RAG techniques that combine the power of LLMs with custom knowledge to get more contextually aware responses. I used both ready-to-use tools and programmatically handled the RAG pipeline. I tested this using the PDF version of the Red Hat supplementary style guide.
Out-of-the-box RAG tools
While several options are available, I picked AnythingLLM to test this bit because it is available as an application without additional dependencies.
I uploaded the supplementary style guide as a reference and asked it to review the text.
Prompt |
---|
Your task is to review the following text based on your familiarity with the Red Hat Supplementary Style guide and provide suggestions for updating it. Heat blacklists any servers in the list from receiving updated heat deployments. After the stack operation completes, any blacklisted servers remain unchanged. You can also power off or stop the os-collect-config agents during the operation. |
Response |
- Use active voice: “Heat blacklists servers from receiving updated deployments.” - Use present tense for the second sentence: “After the stack operation completes, any blacklisted servers remain unchanged.” - Clarify what is being referred to as “agents” in the third sentence. Suggested text: Heat blacklists servers from receiving updated deployments. After completing a stack operation, any servers on the blacklist remain unchanged. You can also power off or stop the os-collect-config agents during this process. |
Observations
Unsurprisingly, many of the responses were off mark. RAG works best for retrieving matching information from the supplied sources. In this case, it didn’t work.
Programmatically handling the RAG pipeline
While working with AnythingLLM and not getting the desired results, I wanted to explore what happens in the RAG workflow and how it uses the information I provide. So I attempted to make my own RAG pipeline programmatically.
I didn’t have to create everything from scratch, as many open-source repositories provide the code for handling all the bits involved. The key pieces in the RAG pipeline are:
- Chunking text: It involves breaking down the text into smaller, meaningful segments.
- Embedding models: These are the LLMs that transform raw text into vector representations (embeddings).
- Vector databases: Vector databases store embeddings along with associated metadata.
As I learned more about these, I found out there are various chunking strategies that directly affect the response you get. Additionally, different embedding models create embeddings differently.
And there could be a combination of these that decides the quality of the response you get from your model. 🤯
Observations
Like the out-of-the-box tools, my RAG pipeline didn’t work as expected. While it was good at retrieving information, it could not examine content for style-related issues.
Experiment 3: Fine-tuning techniques
This led me to experiment with fine-tuning the LLM, that is, teaching an existing model how to review the content. I tried to test this idea using InstructLab.
I picked the conscious language rule from the supplementary style guide and tried fine-tuning the default model with that.
The results? Perfecto! 🎉
It did very well, and the model learned to update the text based on this rule.
Prompt |
---|
How can I update the language in this paragraph to use conscious Language? The current configuration uses a whitelist to filter incoming traffic. To add a new IP address, you need to update the list manually. |
Response |
The current configuration uses an allowlist to filter incoming traffic. To add a new IP address, you need to update the list manually. Explanation: Whitelist can be perceived as exclusive, implying that only certain IP addresses are allowed. Instead, allowlist is more inclusive, indicating that the list includes the allowed IP addresses. This revision aligns with the principle of conscious language, which promotes inclusive and respectful communication. |
Observations
The prompt here differs from my other experiments, but I wrote it this way since this is the only rule I fine-tuned the model on. This leads me to believe that we need to teach the rules to an LLM to review it based on our style guide. Also, since InstructLab generates synthetic data, we want the desired model to be more grounded, so we can’t use InstructLab for fine-tuning. This is an assumption based on my limited knowledge of InstructLab.
What’s next?
Fine-tuning a model on custom data requires a custom data set. The most common data sets I saw online require prompt, input (optional), and response information. The more examples we put in, the better the model performs after fine-tuning.
Next, I plan to test how a model performs based on learning from a small set of rules.
Stay tuned for more updates on this journey!
Meanwhile, this quest has been a great learning experience. I have learned a lot about LLMs, RAG pipelines, and fine-tuning models. I am excited to see how this can be applied to other use cases.
Here are some of the useful tools I found out during this exercise:
-
Chat with your documents
-
Run LLMs locally or expose them as API endpoints