How to Set Up a Private Knowledge Base for Your Team Using Open-Source RAG Tools

Tools & Technical Tutorials

22 May 2026 | By Ashley Marshall

How to Set Up a Private Knowledge Base for Your Team Using Open-Source RAG Tools?

Install Qdrant as your vector database, use LlamaIndex or LangChain to handle document ingestion and retrieval, and pair it with a local LLM via Ollama or AnythingLLM. The whole stack runs on-premise, keeps data under your control, and costs nothing in API fees once it is running.

Your team's best thinking is probably trapped in emails, shared drives, and people's heads. A private knowledge base built on open-source RAG tools can surface it on demand - without sending a single byte to a third-party AI provider.

Why Private Knowledge Bases Matter Right Now

The case for a private knowledge base has sharpened considerably in 2026. UK businesses operate under UK GDPR - carrying the same data transfer restrictions as EU GDPR but with the ICO as the domestic regulator. Sending internal documents to a third-party AI service, whether that is ChatGPT, Gemini, or any hosted RAG product, means those documents are processed outside your infrastructure. For most general queries that is acceptable. For anything touching client data, financial records, HR information, or competitive strategy, it is not.

But the compliance angle is only part of the story. The more compelling reason is knowledge retention. When a senior employee leaves, or when institutional memory is scattered across Confluence, SharePoint, Google Drive, and old email threads, retrieval becomes expensive - if it happens at all. A well-configured RAG system changes that dynamic. Instead of searching, your team asks a question and gets a synthesised answer drawn from your actual documents, not the public internet.

The good news: the open-source tooling has reached genuine production maturity in 2026. Tools like Qdrant, with over 29,000 GitHub stars and 250 million downloads as of March 2026, are not hobbyist experiments. They are running in production at HubSpot, Canva, and Tripadvisor. The vector database market has grown from $1.73 billion in 2024 and is projected to reach $10.6 billion by 2032, and the open-source tier of that market is now ready for business use.

For UK teams, the data residency argument is clear. Run the stack on your own servers, or on a UK-based cloud provider, and you maintain full control: no third-party telemetry, no model training on your data, no ambiguity in your audit trail. The ICO's guidance on AI systems increasingly expects organisations to be able to demonstrate where their data goes and who processes it. Self-hosted RAG makes that demonstration straightforward.

Understanding the RAG Stack: Five Decisions You Need to Make

Before picking tools, it helps to understand what a RAG system actually does. The acronym stands for Retrieval-Augmented Generation, and the architecture has five distinct layers, each requiring a deliberate decision.

Document ingestion layer. This is where your source documents enter the system. A good ingestion pipeline handles PDFs, DOCX files, Markdown, HTML, and plain text. LlamaIndex has the most comprehensive set of document loaders, including connectors for Notion, Confluence, Google Drive, and SharePoint, making it the right choice if your knowledge base spans multiple sources. LangChain covers this well too, but with a heavier abstraction layer that adds complexity without proportional benefit for smaller deployments.

Embedding model. Your documents are converted into numerical vectors by an embedding model. The choice here matters more than most teams realise. Using a cloud embedding API like OpenAI's text-embedding-3-large is convenient but sends your documents to a third-party server. For a genuinely private setup, run a local embedding model via Ollama. Models like nomic-embed-text or mxbai-embed-large run comfortably on a modern server with 16GB RAM and produce competitive embeddings without leaving your network.

Vector database. The numerical vectors are stored and indexed here. This is where retrieval happens. We cover the choice between Qdrant, Chroma, and Weaviate in the next section.

Retrieval layer. When a user asks a question, the query is converted to a vector and the database returns the most semantically similar document chunks. The retrieval configuration - how many chunks to return (top-K), the similarity threshold, whether to use hybrid search combining dense and sparse vectors - significantly affects answer quality.

LLM for synthesis. The retrieved chunks are passed as context to an LLM, which generates a coherent answer. For a fully private stack, Ollama with Llama 3.1 or Mistral handles this locally. For teams comfortable with API usage, you can point the synthesis step at a cloud LLM while keeping all document retrieval local.

The critical insight here, confirmed by multiple production case studies, is that the quality of your answers is almost entirely determined by what happens before the LLM sees the query. Retrieval quality is the dominant variable. This is why so many teams end up disappointed with RAG - they focus on the LLM and neglect the ingestion and retrieval layers entirely.

The Vector Database Decision: Qdrant, Chroma, or Weaviate

The vector database you choose sets the ceiling on your system's retrieval performance. In 2026, there are three serious open-source options for private deployments, each with a distinct use case.

Qdrant is the standout choice for production private knowledge bases. Built in Rust for maximum performance, it delivers p99 query latency of 30-40ms at scale. That is meaningfully faster than Weaviate at 50-70ms and significantly faster than managed services like Pinecone at 50-100ms. Qdrant raised a $50 million Series B in March 2026, confirming its position as core retrieval infrastructure for production AI systems.

What makes Qdrant particularly well-suited to private deployments is its composable architecture. Every aspect of retrieval is a tunable primitive: indexing strategy, scoring function, filtering logic, and ranking. You are not accepting opaque defaults and hoping they work. Hybrid search (dense vectors plus sparse keyword matching plus metadata filtering) works in a single query, which meaningfully improves recall on real-world business documents where users ask questions that mix semantic intent with specific terms like contract numbers or project codes.

Deploy Qdrant with a single Docker command:

docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant

Chroma is the right choice for teams prototyping or running a smaller knowledge base of under 100,000 document chunks. It is simpler to configure, has excellent Python integration, and can run in-process without Docker if needed. The trade-off is that it does not scale as cleanly as Qdrant and lacks the filtering capabilities needed for complex queries across large document sets. For a pilot project, Chroma is a sensible starting point before committing to a production deployment.

Weaviate occupies the middle ground. It has strong multi-modal support covering text, images, and code, and a more opinionated schema system that some teams find helpful for structured data. The p99 latency of 50-70ms is competitive. Weaviate's query interface is powerful but adds a learning curve. For most UK teams building a document-oriented knowledge base, Qdrant delivers better raw performance with a shorter path to production.

Chunking Strategy: The Part Everyone Gets Wrong

Document chunking is responsible for more failed RAG deployments than any other single factor. When you ingest a document, you split it into pieces before embedding them. The size, position, and overlap of those chunks directly determines whether your retrieval system can find the right information when someone asks a question.

The most common mistake is using fixed-size chunks with no thought for document structure. Splitting a 3,000-word policy document into 512-token blocks with no overlap will frequently cut across important sentences, separating a concept from its context. The embedding for a mid-chunk fragment that begins with something like 'however, this does not apply in cases where...' carries almost no useful semantic signal because the context has been lost at the split point.

Practical chunking rules by document type:

For prose documents such as reports, policies, and FAQs: use 512-1024 token chunks with 10-20% overlap. LlamaIndex's SentenceSplitter handles this well, respecting sentence boundaries rather than cutting mid-sentence. LangChain's RecursiveCharacterTextSplitter is the equivalent tool in that ecosystem. The overlap setting matters: a 10-15% overlap between adjacent chunks ensures that sentences near chunk boundaries appear in both chunks, so retrieval can surface them regardless of which chunk is returned.

For structured documents such as spreadsheets, tables, and numbered SOPs: split by logical unit rather than token count. A single row of a table or a numbered procedure step makes a better chunk than an arbitrary token window that cuts across row boundaries.

For code or technical documentation: split by function, class definition, or section heading rather than by character count. Preserving logical structure gives the embedding model something coherent to work with.

Beyond fixed-size splitting, consider semantic chunking: rather than splitting by token count, split when the topic changes. LlamaIndex supports this natively with its SemanticSplitterNodeParser. It is slower to ingest but meaningfully improves retrieval quality for long documents that cover multiple distinct topics in sequence.

One often-overlooked detail: metadata. Every chunk should carry metadata identifying its source document, page or section number, and date. Without this, your LLM cannot cite sources, users cannot verify answers, and you cannot audit why a particular answer was generated. LlamaIndex propagates source metadata automatically when files are properly named. Treat metadata as a first-class concern, not an afterthought.

The Full Setup: From Documents to Working Knowledge Base

There are two practical paths to a working private knowledge base, depending on your team's technical capacity and appetite for configuration.

Path A: AnythingLLM for non-technical teams. AnythingLLM is an all-in-one Docker application that handles document ingestion, embedding, vector storage, and LLM integration through a browser interface. No Python scripts, no configuration files, no command-line setup beyond the initial Docker run. You upload documents, connect a local LLM via Ollama, and start asking questions in a chat interface similar to ChatGPT.

docker pull mintplexlabs/anythingllm
docker run -d -p 3001:3001 \
  -v ${PWD}/anythingllm_storage:/app/server/storage \
  mintplexlabs/anythingllm

Open http://localhost:3001, configure Ollama as your LLM provider (point it at http://localhost:11434), and upload your first workspace documents. AnythingLLM supports multiple workspaces with different document sets and different access permissions, so you can separate HR documents from client-facing knowledge, or give different teams access to different knowledge bases without a single shared index.

Path B: LlamaIndex + Qdrant + Ollama for technical teams. This gives you full control over every component and scales to millions of documents. A working pipeline in Python:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.ollama import OllamaEmbedding
import qdrant_client

client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="team_knowledge")

embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434"
)

documents = SimpleDirectoryReader("./documents").load_data()
index = VectorStoreIndex.from_documents(
    documents,
    vector_store=vector_store,
    embed_model=embed_model
)

query_engine = index.as_query_engine()
response = query_engine.query("What is our policy on remote working?")
print(response)

This stack keeps every byte on your servers. The embedding model (nomic-embed-text via Ollama) runs locally. The vectors are stored in your Qdrant instance. The LLM queried for synthesis, also via Ollama, runs locally. For a team of 10 to 50 people with a knowledge base of a few thousand documents, this runs comfortably on a server with a modern CPU and 32GB RAM. Adding a GPU - even a consumer RTX 4070 - significantly improves local LLM response speed without changing the architecture.

For keeping the knowledge base current, set up a scheduled ingestion job using LlamaIndex's SimpleDirectoryReader with recursive=True pointed at a watched folder. New or updated documents are re-indexed automatically. Qdrant handles incremental updates cleanly: you can add, update, or delete individual document chunks without rebuilding the entire index.

Common Failure Modes and How to Avoid Them

Understanding what goes wrong in RAG deployments is more useful than any setup guide. These are the failure modes that appear most frequently in production private knowledge bases.

Contradicting documents in the knowledge base. If your knowledge base contains multiple versions of the same policy - say, v1.2 and v2.0 of the same staff handbook - the retrieval layer will return chunks from both. The LLM will then synthesise a response that does not accurately match either version. Audit your documents before ingestion. Remove superseded versions. Use metadata to mark document dates and version numbers so the LLM can reason about recency when multiple sources conflict.

Embedding model mismatch. If you ingest documents with one embedding model and later switch to another, your stored vectors are no longer comparable to new query vectors. Retrieval silently degrades: queries return apparently relevant chunks that have drifted from the query's semantic space. Always use the same embedding model for ingestion and for query time. Document your model choice in your configuration. If you do change models, re-embed everything from scratch.

Top-K retrieval calibration. The default top-K of 4-5 chunks works for well-structured knowledge bases with clean documents. For dense technical documentation, increase to 8-10 chunks. For simple conversational queries against a small knowledge base, 3 is often enough. Monitor the actual chunks being retrieved during testing. If you consistently see irrelevant chunks appearing in the top results, reduce K and tighten your similarity threshold rather than hoping the LLM ignores the noise.

No metadata, no traceability. Without source citations attached to every answer, users cannot verify responses and you cannot debug incorrect outputs. This is particularly important in regulated industries where you may need to show an auditor exactly which document generated a particular answer. Treat metadata as a requirement, not a nice-to-have.

The knowledge base going stale. The most common long-term failure mode is simply not maintaining the index. Documents get updated in your shared drive but the RAG index does not reflect the changes. Build re-ingestion into your workflow from day one: either a scheduled job that processes recently modified files, or a webhook from your document management system that triggers re-indexing on change.

Over-chunking small documents. A one-page FAQ split into 512-token chunks produces fragments too small to carry useful context. For short documents under 800 words, treat the entire document as a single chunk rather than splitting it. The chunk-size rules for long documents do not apply uniformly to short ones.

When a Hosted Solution Is Actually the Right Call

Not every team should build and maintain their own RAG infrastructure. The self-hosted route makes sense when data privacy is non-negotiable, document volume is significant, and there is genuine technical capacity to maintain the stack. When those conditions are not all met, a hosted solution is the pragmatic choice.

A hosted RAG product - whether that is a feature within your existing productivity suite (Microsoft Copilot for Microsoft 365, Google's NotebookLM for Teams) or a specialist tool (Notion AI, Guru, Glean) - is probably the right call in the following circumstances.

Your documents contain no sensitive data. If the information in your knowledge base is already stored in cloud services and subject to those providers' terms, adding a hosted RAG layer does not meaningfully increase your risk profile. The compliance argument for self-hosting only applies when you are dealing with data that requires domestic residency or strict access controls.

Your team has no one who can maintain server applications. Open-source RAG stacks require monitoring, backup, and occasional debugging. Docker containers need updating. Qdrant needs disk space management as your vector index grows. Ollama models need version management. If that expertise does not exist in-house and you do not want to hire for it, the maintenance overhead will consume the cost savings quickly.

Your use case is exploratory. If you are trying to understand whether a knowledge base would actually change how your team works before committing to infrastructure, a hosted trial is a faster path to an honest answer. AnythingLLM's cloud tier or a free trial of a hosted product costs nothing to test for a month and gives you real usage data to inform the build decision.

The total cost of ownership calculation is worth doing explicitly. A self-hosted Qdrant plus LlamaIndex plus Ollama stack on a dedicated server costs roughly PS30-60 per month in compute. A hosted productivity AI runs PS20-30 per user per month. At five users, hosted wins on cost. At 25 users with sensitive documents, self-hosted becomes compelling on both cost and compliance grounds. Run the numbers for your team size before defaulting to either option.

Frequently Asked Questions

What hardware do I need to run a private RAG stack for a small team?

A server with a modern CPU and 32GB RAM handles a knowledge base of several thousand documents comfortably using Qdrant, LlamaIndex, and Ollama. Local LLM response speed improves significantly with a GPU - even a consumer RTX 4070 makes a noticeable difference. For larger knowledge bases, NVMe storage is important as Qdrant can be configured to keep vectors on disk when RAM is constrained.

How does a private RAG stack satisfy UK GDPR requirements?

Running the entire stack on-premise or on a UK-based cloud provider means your documents never leave your infrastructure. There is no third-party data processor involved in document storage or retrieval. You can document exactly where data is stored and who has access, which satisfies the ICO's expectations around AI system transparency. The key is that local embedding models (via Ollama) and a local vector database (Qdrant) mean no document content is transmitted externally.

What is the difference between LlamaIndex and LangChain for RAG?

LlamaIndex is optimised specifically for indexing and retrieval over your own data. It has more document loaders, better handling of different document types, and a cleaner API for building RAG pipelines. LangChain is a broader framework for building LLM applications, with RAG as one capability among many. For a knowledge base project, LlamaIndex is generally the more direct path. Teams already using LangChain for other AI features may prefer to keep their stack consistent.

How often should I re-index documents in my knowledge base?

For actively maintained knowledge bases, daily re-ingestion of recently modified documents is a reasonable default. Qdrant handles incremental updates cleanly - you do not need to rebuild the entire index when a single document changes. Set up a scheduled job that checks your source document directories for files modified in the last 24 hours and re-processes those. For knowledge bases that change infrequently, weekly re-indexing is sufficient.

Can I use cloud LLMs (like GPT-4 or Claude) with a private RAG stack?

Yes. Using a cloud LLM for the synthesis step does not compromise the privacy of your documents, as long as your retrieval stack is local. The cloud LLM only receives the retrieved text chunks as context - it does not have access to your vector index or source documents. If the retrieved chunks themselves contain sensitive data, that data does travel to the cloud LLM's API. For maximum privacy, use a local LLM via Ollama for synthesis as well as retrieval.

What embedding model should I use for a local private knowledge base?

Nomic-embed-text and mxbai-embed-large are both strong choices for local deployment via Ollama. They run comfortably on CPU with 16GB RAM and produce embeddings competitive with cloud-based alternatives. For code or technical documentation, nomic-embed-text-v1.5 has been optimised for mixed text and code. The most important constraint is consistency: once you choose an embedding model, use it for all ingestion and all queries, and do not change it without re-embedding your entire knowledge base.

How do I handle documents in different formats - PDFs, Word files, Markdown?

LlamaIndex's SimpleDirectoryReader automatically detects and handles PDFs, DOCX, Markdown, HTML, and plain text. For PDFs with complex layouts, consider using a dedicated PDF parser - LlamaParse (LlamaIndex's commercial parser) or pypdf for open-source use. Word documents are handled well by python-docx integration. The main challenge with PDFs is table extraction: tabular data in PDFs often loses its structure during ingestion and should be pre-processed or stored as CSV before being fed into the RAG pipeline.

Is AnythingLLM suitable for a team knowledge base or just individual use?

AnythingLLM's Docker version supports multi-user deployments with workspace-level permissions. You can create separate workspaces for different teams or document collections, assign users to specific workspaces, and control which LLM and embedding model each workspace uses. For teams of up to 50 users with straightforward knowledge base needs and no requirement to customise retrieval behaviour, AnythingLLM is a production-ready choice. Larger deployments or those requiring custom retrieval logic should use the LlamaIndex plus Qdrant approach instead.