Building a local Retrieval-Augmented Generation (RAG) application has become a powerful way to provide accurate, up-to-date, and contextually relevant AI-generated content. With the growing interest in using smaller, specialized models, Ollama has emerged as a viable alternative to larger AI providers like OpenAI or Google. This blog will guide you through building a localized RAG application using Ollama, an efficient and user-friendly platform for deploying and managing custom AI models.
Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval (Retrieval) with generative models (Generation). By retrieving relevant information from external knowledge bases, it enhances the knowledge accuracy and response quality of generative AI models (such as GPT). RAG technology is highly effective in solving complex tasks that require large amounts of external knowledge and is widely applied in scenarios such as question-answering systems, document generation, and knowledge retrieval.
Ollama is designed for developers who want to create domain-specific, localized AI applications without relying on big tech companies like Microsoft or Google. It supports running language models locally or in a private cloud environment, offering more control and flexibility in terms of data privacy and model tuning.
Some advantages of using Ollama for building a RAG application:
Localized Deployment: Ollama allows you to deploy models on-premises, which is ideal for businesses or individuals who need localized control over their data.
Smaller, Domain-Specific Models: Ollama focuses on creating smaller, efficient models that are more targeted to specific use cases.
Cost-Effective: By reducing dependency on large cloud infrastructures, Ollama can help reduce operational costs.
Privacy: Ollama’s local deployment ensures that sensitive or proprietary data stays on your servers.
CUDA-capable GPU: Ensure you have an NVIDIA GPU installed.
Docker: Ensure that Docker is installed on your system. You can install Docker by following the official installation guide (https://docs.docker.com/engine/install/) for your operating system here.
NVIDIA Driver: Ensure that the appropriate NVIDIA driver is installed on your system. You can check the installation with the nvidia-smi command.
Other Toolkits: Node.js v20.11.1, pnpm v9.12.2
Ollama provides the backend infrastructure needed to run LLMs locally. To get started, head to Ollama's website and download the application. Follow the instructions to set it up on your local machine. By default, it's running on http://localhost:11434.
Please refer to https://docs.trychroma.com/getting-started for Chroma DB installation. We recommend you run it in a docker container:
#https://hub.docker.com/r/chromadb/chroma/tags $ docker pull chromadb/chroma $ docker run -d -p 8000:8000 chromadb/chroma
Note: There are many options for local storage of vector data in a localized RAG application, such as Chroma DB, Milvus, and so on. In ChatOllama Chroma DB is chosen, which also requires the user to run Chroma server in the local environment as well when running ChatOllama. Now, Chroma DB is running on http://localhost:8000.
ChatOllama is an open source chatbot based on LLMs. It supports a wide range of language models, and knowledge base management. Now, we can complete the necessary setup to run ChatOllama. Clone the repo from github.com and copy the .env.example file to .env file:
$ git clone https://github.com/sugarforever/chat-ollama.git $ cd chat-ollama $ cp .env.example .env
Make sure to install the dependencies:
$ pnpm install
Run a migration to create your database tables with Prisma Migrate
$ pnpm prisma-migrate
Make sure both Ollama Server and Chroma DB are running. Start the development server on http://localhost:3000:
$ pnpm dev
Open the ChatOllama web interface by pointing your browser to http://localhost:3000.
Go to Settings to set the host of your Ollama server.
Click the Models tab to manage models: list, download or delete a model. Note: Ollama supports the currently very popular text embedding model nomic-embed-text for very long context windows. This is also the model I will use in ChatOllama.
Click the Knowledge Bases tab to manage models: Select files, folders or directly enter urls, specify the text embedding model, set the name, and create a knowledge base.
Click on the created knowledge base to go to the chat screen.
Using Ollama to build a localized RAG application gives you the flexibility, privacy, and customization that many developers and organizations seek. By combining powerful retrieval tools with efficient generative models, you can provide highly relevant and up-to-date responses tailored to your specific audience or region.