Large
language models (LLMs) like Llama 3.1 are powerful, but their knowledge
is frozen at the point of their last training. What if you want to
ground an LLM with your own private, domain-specific data? You could
fine-tune it, but that requires massive resources. The smarter, more
accessible approach is Retrieval-Augmented Generation (RAG).
RAG
allows you to combine the impressive generative power of a local LLM
with your own knowledge base. In this blog post, we’ll build a RAG model
that turns Llama 3.1 into a Wireshark expert, using data from the
WiresharkWiki. We'll accomplish this using Ollama to run our LLM.
Why RAG over fine-tuning?
Fine-tuning
involves retraining the base model on a new, specialized dataset. While
it results in a highly customized model, the process is:
- Resource-intensive: Requires powerful GPUs and significant compute time.
- Data-demanding: Needs a large, carefully curated dataset.
- Costly: Can be prohibitively expensive for individuals or small teams.
- Infrequent: Requires repeating the process to incorporate new information.
RAG, on the other hand, is a lightweight and dynamic solution:
- Efficient: Leverages the base model's capabilities, using your private data for context.
- Flexible: Easily updates the knowledge base by adding or removing documents without retraining.
- Grounded: Reduces the risk of "hallucinations" by providing the LLM with relevant, verifiable facts.
The blueprint: How our Wireshark RAG works
Our pipeline has two main phases:
1. Ingestion:
- We'll scrape content from the WiresharkWiki.
- The raw text will be broken into smaller, manageable chunks.
- An Ollama embedding model will convert these text chunks into numerical vectors.
- These vectors and their corresponding text chunks will be stored in a vector database.
2. Querying:
- When a user asks a question, we'll use the same embedding model to create a vector from the query.
- The system will search the vector database for the most relevant document chunks based on semantic similarity.
- The retrieved chunks will be added to the user's prompt as context.
- The full, augmented prompt will be sent to the local Llama 3.1 model (served by Ollama) for a grounded answer.
Step-by-step: Building your local Wireshark expert
First, install Ollama and pull the necessary models. We'll use Llama 3.1 instead of GPT-OSS-20b (as configure in the above tutorial) for generation and for creating our vector embeddings.
Step2: Installation Open WebUI
Open WebUI is an extensible, feature-rich, and user-friendly self-hosted AI platform designed to operate entirely offline. It supports various LLM runners like Ollama and OpenAI-compatible APIs, with built-in inference engine for RAG, making it a powerful AI deployment solution.
Make sure your docker application working fine.
docker run -d \
--name open-webui \
--restart always \
--network host \
-e OLLAMA_BASE_URL=http://localhost:11434 \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
Creating the knowledge base profile for our model.Now Create a profile and upload the resources, in our case we will add the Wireshark document.
We add the Wireshark Basics as a references, to train the model.
We can see that our model configuration has been completed, ready to use.
To use the model, you should follow screenshots.
Lets have a Prompt. Likewise you can train your own model, like Pentester, Red Teaming, Blue Teaming etc.