Build Personal ChatGPT Using Your Data
Harnessing AI to Build Your Personal Secure Knowledge Navigator
Have you ever wished for your own AI assistant that knows your taste in literature, understands your notes, and can converse with the nuances of your favorite authors? Today, we're embarking on a unique journey of turning this concept into reality. We will guide you through building your own ChatGPT using personal data like your favorite books or articles.
This post is for technical readers who love tinkering and the common reader who cherishes their collection of digital literature.
Why You Should Build Your ChatGPT
In this era of information overload, our digital lives are brimming with data, much of which contains valuable knowledge and insights. Daily, we consume vast amounts of digital content, from the latest news articles to in-depth academic papers, eBooks, and even personal notes. While this wealth of data provides us with immense knowledge, navigating through it or retrieving specific pieces of information can often be daunting.
That's where building your personal ChatGPT comes into the picture. By leveraging artificial intelligence, we can make our data more interactive, accessible, and utilizable.
Here are some key reasons why you should consider building your own ChatGPT:
1. Personalized Knowledge Navigator
Your personalized ChatGPT can function as a unique knowledge navigator, understanding your notes, recalling details from your favorite books, and contextually answering your questions. It can even simulate the writing styles of your favorite authors, providing a novel and engaging way to interact with your data.
2. Efficient Data Retrieval
Instead of manually searching your files for specific information, your ChatGPT can provide efficient and relevant responses to your queries. This enables faster decision-making and saves precious time.
3. Learning & Development
Building a chatbot from your data also serves as an excellent learning experience. It offers a hands-on introduction to exciting areas of AI, such as Natural Language Processing, information retrieval, and machine learning.
4. Privacy and Control
With your personal ChatGPT, you have complete control over your data. You can run it locally to ensure data privacy, and you decide what information to feed into your model. This is a significant advantage in a world where data privacy is a growing concern.
How to Build Your ChatGPT
Curious to get started? Visit this GitHub repository for a step-by-step guide and complete codebase.
Gathering Your Data
The first step involves gathering all your text data into a common source. You could have a variety of text files, PDFs, eBooks, and other forms of text data. You can even employ the Google Drive reader to index data straight from your Google Drive.
For our toy project, let us process a batch of PDF files, which are located in your local folder.
Iterate all PDF files in the specified folder.
Extract the text from each PDF.
Write the extracted text into a `.txt` file in the provided output folder.
Split the document into chunks that fit into GPT-2 tokenization length limits. Here we are using the `RecursiveCharacterTextSplitter` class from the `langchain` library.
Return all chunks as a list.
def process_pdf_folder(pdf_folder_name,txt_folder_name):
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 512,
chunk_overlap = 24,
length_function = tokenizer.encode,
)
all_chunks = []
for filename in os.listdir(pdf_folder_name):
if filename.endswith(".pdf"):
filepath = os.path.join(pdf_folder_name, filename)
doc = textract.process(filepath)
txt_filename = filename.replace(".pdf", ".txt")
txt_filepath = os.path.join(txt_folder_name, txt_filename)
with open(txt_filepath, 'w') as f:
f.write(doc.decode('utf-8'))
with open(txt_filepath, 'r') as f:
text = f.read()
chunks = text_splitter.create_documents([text])
all_chunks.append(chunks)
return all_chunks
Indexing Your Data
Once you've gathered your data, you'll need to index it. This is where langchain comes into play. Langchain is a powerful tool adept at language understanding tasks. It will create document embeddings from the manageable chunks of text extracted from your files.
To store these embeddings, either use local storage or a cloud provider,
1. FAISS (Facebook AI Similarity Search) is a library that allows efficient similarity search and clustering of dense vectors. Your embeddings stored in a FAISS database can be effectively retrieved when interacting with your chatbot.
2. Pinecone is a vector database designed for machine learning applications. It's ideal if you prefer storing your embeddings in the cloud.
If you are interested in exploring other vector databases, I recommend watching Fireship's video to learn more about different vector databases.
Let us use FAISS for our toy project. For the chunks generated above, embeddings should be created and stored in FAISS.
for chunk in all_chunks:
FAISS.from_documents(chunk, embeddings)
Get User Query
We now have a system that can accept a question and fetch the most suitable response from our database of document chunks. Get user queries and then feed them into our ConversationalRetrievalChain instance, which wraps around OpenAI's language model.
chat_history = []
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())
while True:
query = input("Enter a query (type 'exit' to quit): ")
if query.lower() == "exit":
break
result = qa({"question": query, "chat_history": chat_history})
chat_history.append((query, result['answer']))
print(result['answer'])
Instead of rebuilding this, you can simply download the entire code snippet from this GitHub repository. Happy building!
Addressing Privacy Concerns
When dealing with personal data, we understand the paramount importance of privacy. OpenAI does not store the data passed to it. Moreover, you can run the system locally with FAISS if you prefer not to publish it online. Your data is securely stored and inaccessible to others for cloud storage via Pinecone.
A Personal Case Study
As a testament to the power of this tooling, I've indexed all the publicly available teachings of Swami Sivananda, a cherished author of mine. Using this, I've built a chatbot to engage in enlightening discussions, reflecting his wisdom. I invite you to experience a conversation with this AI version of Swami Sivananda at swamisivananda.ai. This exemplifies the unique and personal interactions you can create by building your own ChatGPT.
Remember, technology is a powerful tool, but our creativity transforms this tool into something magical. Keep exploring, and keep innovating!
Newbie here. I'm getting TypeError: multiple bases have instance lay-out conflict
Could you point me in the right direction?
I have tested this project and found it works well for a hand full of PDF's. It does not scale out however and you run into issues with embeddings/tokens. Do you have any guidance on how to overcome this problem?