Working with massive amounts of data and Large Language Models (LLMs) has become commonplace these days and you’ve undoubtedly heard of all the buzzwords and acronyms associated with the topic. In this article we will focus on one that will help us explain a common use-case for using Wikimedia Enterprise APIs in the AI space. RAG, or Retrieval-Augmented Generation, is a machine learning approach that integrates real-time retrieval of relevant information into the response generation process. That means when you prompt the LLM the RAG process consults a separate data store that would contain hyper-specific content to help improve the accuracy and context of the response. That data store, or retrieval corpus, could be a subset of Wikipedia articles or your company’s FAQs that can help answer questions that a Large Language Model may not have insight about, or at least not to the level of detail you require.
In this article we’re going to build a local RAG-based LLM demo application to show how a small subset of Wikipedia articles can improve generated responses from open-sourced LLMs. We’ll be using Meta’s Llama 3 with Ollama and MXbai vector embeddings to build the local model and database and then Python scripting pulling data from our Structured Contents endpoint, a part of On-demand API. We recommend you sign up for a free Wikimedia Enterprise account so you can pull the most recent article data, but we provide some sample data as a fallback. All of the code and example responses are available in the accompanying Github repo. Ready to build a local LLM? Let’s go.
1. Download and install Ollama and follow the quick setup instructions: https://ollama.com/download
2. Download the models llama3
and mxbai
. In a terminal console, type (llama3 is a 4.7GB download, mxbai-embed-large is 670MB):
ollama pull mxbai-embed-large
ollama run llama3
Notes:
- As of March 2024, mxbai-embed-large model archives SOTA performance for Bert-large sized models on the MTEB. It outperforms commercial models like OpenAIs text-embedding-3-large model and matches the performance of model 20x its size.
- Llama 3 (8B) instruction-tuned models are fine-tuned and optimized for dialogue/chat use cases and outperform many of the available open-source chat models on common benchmarks.
- Bonus, if you have a powerful laptop/desktop you might want to swap Llama3 8 billion parameter model for the Llama3 70 billion parameter, which has better inference and more internal knowledge, use 70B instead use this command
ollama run llama3:70b
(note this is a 40GB download), and change the line of code in query.py that loads the llama3 model to:model="llama3:70b"
3. Verify that Ollama is working and using the model, the output should be a JSON object with an embedding array of floating point numbers:
curl http://localhost:11434/api/embeddings -d '{
"model": "mxbai-embed-large",
"prompt": "Summarize the features of Wikipedia in 5 bullet points"
}'
4. Clone our demo Repo with all the code to get started:
git clone https://github.com/wikimedia-enterprise/Structured-Contents-LLM-RAG.git
5. Install virtual Python environment, activate it, and install Python packages in requirements.txt
:
python3 -m venv myenv && source myenv/bin/activate
pip install -r requirements.txt
6. Edit the Environment variables file to add your Wikimedia Enterprise API credentials. If you don’t have an account yet; signup for free. Then rename sample.env
to .env
and add your Wikimedia Enterprise username and password:
WIKI_API_USERNAME=username
WIKI_API_PASSWORD=password
Note: You can skip this next step if you have a slow internet connection and instead use the /dataset/en_sample.csv
file that has the structured Wikipedia data ready to use in Step 7.
7. Review the Python in get_dataset.py
which calls the Wikipedia Enterprise On-Demand API for 500 English articles. We’re using our Structured Contents endpoint that has pre-parsed article body sections to cleanly obtain data without extra parsing. You can run that process with this command:
python get_dataset.py
Notes:
- In
get_dataset.py
, we are using multithreading to download the dataset, using your CPU Cores to send many requests at one. If you prefer to keep it simple, we have a less complex downloader that downloads the data in sequence, but it takes considerably longer. See the code inpipelineV1()
andpipelineV2()
, the first function runs sequentially, the second runs in parallel. Notice we are using thread locking to guarantee that the array is appended without a race condition. - The script will first check if you’ve downloaded new data but will fallback to using sample data if not.
- One important function in this code is
clean_text()
which parses the HTML tags and extracts the plain text that the LLM model is expecting. Data tidying is a big part of the Machine Learning workflow. Review the code inclean_text()
as you may want to understand the text cleaning steps. - Wikimedia Enterprise has a number of added-value APIs, that give developers easier access to cleaned Wikimedia data. You don’t need to be a Data Scientist or AI expert to integrate Wikipedia/Wikidata knowledge into your systems. Visit our developer documentation portal for more API info.
8. Review the Python in import.py which imports the CSV data from step 7 and load it into ChromaDB. Then run it:
python import.py
9. Review the Python in query.py
to input your query, query ChromaDB, get the relevant articles and pass it to Llama3 for generating the response. Run the Streamlit Web UI with:
streamlit run query.py
That should have opened your browser to localhost with the Streamlit UI and a box for your prompts, a checkbox to enable RAG to retrieve the Wikipedia articles to help inform the generated response. For the purposes of this demo send a prompt and wait for the response and then enable RAG and send the same prompt again to see how better informed it can be with localized data from Wikipedia. We’ve just pulled 500 random articles for this demo but I did load some specific pages about SpaceX and Starship launches to help test RAG; it works really nicely. Have fun with it. We have a handful of example responses in the repo for reference.
If you’d like to remove the models when you’re done to save space, you can safely delete all the code and data from this demo, there are no other dependencies. Use these commands to do that:
ollama rm mxbai-embed-large
ollama rm llama3
Congratulations on building your own RAG-based LLM demo application! Through this hands-on demo, we’ve explored the power of Retrieval-Augmented Generation by leveraging the Wikimedia Enterprise On-demand API’s Structured Contents endpoint. By comparing responses with and without RAG, you can see how real-time retrieval of hyper-specific content can significantly enhance the contextual accuracy and relevance of generated responses. The Wikimedia Enterprise API can be an effective solution for integrating up-to-date, structured data into your models and knowledge graphs, making it an invaluable tool for any AI developer.
Thank you for following along with this tutorial. We’re hoping it has been useful and if you’d like to see more articles like this we’d love to hear from you. Feel free to reach out to the team by using the feedback form in your account dashboard, hit up @ChuckReynolds on X, or check out the accompanying GitHub repository to create a pull request or open an issue. Happy coding!
— Wikimedia Enterprise Team