Question and Answer with OpenAI and RedisVL#
This example shows how to use RedisVL to create a question and answer system using OpenAI’s API.
In this notebook we will
Download a dataset of wikipedia articles (thanks to OpenAI’s CDN)
Create embeddings for each article
Create a RedisVL index and store the embeddings with metadata
Construct a simple QnA system using the index and GPT-3
Improve the QnA system with LLM caching
The image below shows the architecture of the system we will create in this notebook.
Setup#
In order to run this example, you will need to have a Redis Stack running locally (or spin up for free on Redis Cloud). You can do this by running the following command in your terminal:
docker run --name redis -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
This will also provide the RedisInsight GUI at http://localhost:8001
Next, we will install the dependencies for this notebook.
# first we need to install a few things
%pip install pandas wget tenacity tiktoken openai==0.28.1
import wget
import pandas as pd
embeddings_url = 'https://cdn.openai.com/API/examples/data/wikipedia_articles_2000.csv'
wget.download(embeddings_url)
df = pd.read_csv('wikipedia_articles_2000.csv')
df = df.drop(columns=['Unnamed: 0'])
df.head()
id | url | title | text | |
---|---|---|---|---|
0 | 3661 | https://simple.wikipedia.org/wiki/Photon | Photon | Photons (from Greek φως, meaning light), in m... |
1 | 7796 | https://simple.wikipedia.org/wiki/Thomas%20Dolby | Thomas Dolby | Thomas Dolby (born Thomas Morgan Robertson; 14... |
2 | 67912 | https://simple.wikipedia.org/wiki/Embroidery | Embroidery | Embroidery is the art of decorating fabric or ... |
3 | 44309 | https://simple.wikipedia.org/wiki/Consecutive%... | Consecutive integer | Consecutive numbers are numbers that follow ea... |
4 | 41741 | https://simple.wikipedia.org/wiki/German%20Empire | German Empire | The German Empire ("Deutsches Reich" or "Deuts... |
Data Preparation#
Text Chunking#
In order to create embeddings for the articles, we will need to chunk the text into smaller pieces. This is because there is a maximum length of text that can be sent to the OpenAI API. The code that follows pulls heavily from this notebook by OpenAI
TEXT_EMBEDDING_CHUNK_SIZE = 1000
EMBEDDINGS_MODEL = "text-embedding-ada-002"
def chunks(text, n, tokenizer):
tokens = tokenizer.encode(text)
"""Yield successive n-sized chunks from text.
Split a text into smaller chunks of size n, preferably ending at the end of a sentence
"""
i = 0
while i < len(tokens):
# Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
j = min(i + int(1.5 * n), len(tokens))
while j > i + int(0.5 * n):
# Decode the tokens and check for full stop or newline
chunk = tokenizer.decode(tokens[i:j])
if chunk.endswith(".") or chunk.endswith("\n"):
break
j -= 1
# If no end of sentence found, use n tokens as the chunk size
if j == i + int(0.5 * n):
j = min(i + n, len(tokens))
yield tokens[i:j]
i = j
def get_unique_id_for_file_chunk(title, chunk_index):
return str(title+"-!"+str(chunk_index))
def chunk_text(record, tokenizer):
chunked_records = []
url = record['url']
title = record['title']
file_body_string = record['text']
"""Return a list of tuples (text_chunk, embedding) for a text."""
token_chunks = list(chunks(file_body_string, TEXT_EMBEDDING_CHUNK_SIZE, tokenizer))
text_chunks = [f'Title: {title};\n'+ tokenizer.decode(chunk) for chunk in token_chunks]
for i, text_chunk in enumerate(text_chunks):
doc_id = get_unique_id_for_file_chunk(title, i)
chunked_records.append(({"id": doc_id,
"url": url,
"title": title,
"content": text_chunk,
"file_chunk_index": i}))
return chunked_records
# Initialise tokenizer
import tiktoken
oai_tokenizer = tiktoken.get_encoding("cl100k_base")
records = []
for _, record in df.iterrows():
records.extend(chunk_text(record, oai_tokenizer))
chunked_data = pd.DataFrame(records)
chunked_data.head()
id | url | title | content | file_chunk_index | |
---|---|---|---|---|---|
0 | Photon-!0 | https://simple.wikipedia.org/wiki/Photon | Photon | Title: Photon;\nPhotons (from Greek φως, mean... | 0 |
1 | Photon-!1 | https://simple.wikipedia.org/wiki/Photon | Photon | Title: Photon;\nElementary particles | 1 |
2 | Thomas Dolby-!0 | https://simple.wikipedia.org/wiki/Thomas%20Dolby | Thomas Dolby | Title: Thomas Dolby;\nThomas Dolby (born Thoma... | 0 |
3 | Embroidery-!0 | https://simple.wikipedia.org/wiki/Embroidery | Embroidery | Title: Embroidery;\nEmbroidery is the art of d... | 0 |
4 | Consecutive integer-!0 | https://simple.wikipedia.org/wiki/Consecutive%... | Consecutive integer | Title: Consecutive integer;\nConsecutive numbe... | 0 |
Embedding Creation#
With the text broken up into chunks, we can create embeddings with the OpenAITextVectorizer
. This provider uses the OpenAI API to create embeddings for the text. The code below shows how to create embeddings for the text chunks.
import os
import getpass
from redisvl.utils.vectorize import OpenAITextVectorizer
api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
oaip = OpenAITextVectorizer(EMBEDDINGS_MODEL, api_config={"api_key": api_key})
chunked_data["embedding"] = oaip.embed_many(chunked_data["content"].tolist(), as_buffer=True, dtype="float32")
chunked_data
id | url | title | content | file_chunk_index | embedding | |
---|---|---|---|---|---|---|
0 | Photon-!0 | https://simple.wikipedia.org/wiki/Photon | Photon | Title: Photon;\nPhotons (from Greek φως, mean... | 0 | b'\x9e\xbf\xc9;\xca\x8e\xfb;\x00\xf8P\xbc\xe5\... |
1 | Photon-!1 | https://simple.wikipedia.org/wiki/Photon | Photon | Title: Photon;\nElementary particles | 1 | b'd\xda#\xbc\xb7\xf1\x8c<\xea\xd0m\xbc\x13\x8b... |
2 | Thomas Dolby-!0 | https://simple.wikipedia.org/wiki/Thomas%20Dolby | Thomas Dolby | Title: Thomas Dolby;\nThomas Dolby (born Thoma... | 0 | b'NG\xce\xbck\xf0\xb2;\x81\xed\xd7\xbc\xb6\x94... |
3 | Embroidery-!0 | https://simple.wikipedia.org/wiki/Embroidery | Embroidery | Title: Embroidery;\nEmbroidery is the art of d... | 0 | b'\xa4\xba\xf5\xbcS\xf3\x02\xbc\xa1\x15O\xbc\x... |
4 | Consecutive integer-!0 | https://simple.wikipedia.org/wiki/Consecutive%... | Consecutive integer | Title: Consecutive integer;\nConsecutive numbe... | 0 | b'0(\xfa\xbb\x81\xd2\xd9;\xaf\x92\x9a;\xd3FL\x... |
... | ... | ... | ... | ... | ... | ... |
2688 | Alanis Morissette-!1 | https://simple.wikipedia.org/wiki/Alanis%20Mor... | Alanis Morissette | Title: Alanis Morissette;\nTwin people from Ca... | 1 | b'Ii4\xbc\x8e>\xe0\xbc\x18]\x07\xbb%\xa0\x92\x... |
2689 | Brontosaurus-!0 | https://simple.wikipedia.org/wiki/Brontosaurus | Brontosaurus | Title: Brontosaurus;\nBrontosaurus is a genus... | 0 | b'\xad\xa5\xdb\xbc\xa5\xa5\xba:\xb4"\x81\xbc\x... |
2690 | Work (physics)-!0 | https://simple.wikipedia.org/wiki/Work%20%28ph... | Work (physics) | Title: Work (physics);\nIn physics, a force do... | 0 | b'\x97\x82\xb9\xbbL\x90d\xbc\xb7G\x9c\xba\x94g... |
2691 | Syllable-!0 | https://simple.wikipedia.org/wiki/Syllable | Syllable | Title: Syllable;\nA syllable is a unit of pron... | 0 | b'\xe4\xa3\x1c:\x83g\x90<\x99=s;*[E\xbb\x10 "\... |
2692 | Syllable-!1 | https://simple.wikipedia.org/wiki/Syllable | Syllable | Title: Syllable;\nGrammar | 1 | b'T,-\xbbS\xe5\x87;\x1c\x0f\x9d:\xc4\xd4\xcd:\... |
2693 rows × 6 columns
Construct the SearchIndex
#
Now that we have the embeddings, we can create a SearchIndex
to store them in Redis. We will use the SearchIndex
to store the embeddings and metadata for each article.
Define the wikipedia IndexSchema
#
%%writefile wiki_schema.yaml
version: '0.1.0'
index:
name: wikipedia
prefix: chunk
fields:
- name: content
type: text
- name: title
type: text
- name: id
type: tag
- name: embedding
type: vector
attrs:
dims: 1536
distance_metric: cosine
algorithm: flat
Overwriting wiki_schema.yaml
import redis.asyncio as redis
from redisvl.index import AsyncSearchIndex
from redisvl.schema import IndexSchema
client = redis.Redis.from_url("redis://localhost:6379")
schema = IndexSchema.from_yaml("wiki_schema.yaml")
index = await AsyncSearchIndex(schema).set_client(client)
await index.create()
!rvl index listall
16:00:26 [RedisVL] INFO Indices:
16:00:26 [RedisVL] INFO 1. wikipedia
Load the wikipedia dataset#
keys = await index.load(chunked_data.to_dict(orient="records"))
Build a simple QnA System#
Now that we have the data and the embeddings, we can build the QnA system. The system will perform three actions
Embed the user question and search for the most similar content
Make a prompt with the query and retrieved content
Send the prompt to the OpenAI API and return the answer
import openai
from redisvl.query import VectorQuery
CHAT_MODEL = "gpt-3.5-turbo"
def make_prompt(query, content):
retrieval_prompt = f'''Use the content to answer the search query the customer has sent.
If you can't answer the user's question, do not guess. If there is no content, respond with "I don't know".
Search query:
{query}
Content:
{content}
Answer:
'''
return retrieval_prompt
async def retrieve_context(index: AsyncSearchIndex, query: str):
# Embed the query
query_embedding = await oaip.aembed(query)
# Get the top result from the index
vector_query = VectorQuery(
vector=query_embedding,
vector_field_name="embedding",
return_fields=["content"],
num_results=1
)
results = await index.query(vector_query)
content = ""
if len(results) > 1:
content = results[0]["content"]
return content
async def answer_question(index: AsyncSearchIndex, query: str):
# Retrieve the context
content = await retrieve_context(index, query)
prompt = make_prompt(query, content)
retrieval = await openai.ChatCompletion.acreate(
model=CHAT_MODEL,
messages=[{'role':"user", 'content': prompt}],
max_tokens=50
)
# Response provided by GPT-3.5
return retrieval['choices'][0]['message']['content']
import textwrap
question = "What is a Brontosaurus?"
textwrap.wrap(await answer_question(index, question), width=80)
['A Brontosaurus, also known as Apatosaurus, is a type of large, long-necked',
'dinosaur that lived during the Late Jurassic Period, about 150 million years',
'ago. They were herbivores and belonged to the saurop']
# Question that makes no sense
question = "What is a trackiosamidon?"
await answer_question(index, question)
"I don't know."
question = "Tell me about the life of Alanis Morissette"
textwrap.wrap(await answer_question(index, question))
['Alanis Morissette is a Canadian-American singer-songwriter and',
'actress. She gained international fame with her third studio album,',
'"Jagged Little Pill," released in 1995. The album went on to become a',
'massive success, selling over']
Improve the QnA System with LLM caching#
The QnA system we built above is pretty good, but we can use the SemanticCache
to improve the throughput and stability. The SemanticCache
will store the results of previous queries and return them if the query is similar enough to a previous query. This will reduce the number of round trip queries we need to send to the OpenAI API.
Note this technique will work assuming we expect a similar profile of queries to be asked.
from redisvl.extensions.llmcache import SemanticCache
cache = SemanticCache(name="qna_cache", redis_url="redis://localhost:6379", distance_threshold=0.2)
async def answer_question(index: AsyncSearchIndex, query: str):
# check the cache
if result := cache.check(prompt=query):
return result[0]['response']
# Retrieve the context
content = await retrieve_context(index, query)
prompt = make_prompt(query, content)
retrieval = await openai.ChatCompletion.acreate(
model=CHAT_MODEL,
messages=[{'role':"user", 'content': prompt}],
max_tokens=500
)
# Response provided by GPT-3.5
answer = retrieval['choices'][0]['message']['content']
# cache the query_embedding and answer
cache.store(query, answer)
return answer
# ask a question to cache an answer
import time
start = time.time()
question = "Tell me about the life of Alanis Morissette"
answer = await answer_question(index, question)
print(f"Time taken: {time.time() - start}\n")
textwrap.wrap(answer, width=80)
Time taken: 6.253775119781494
['Alanis Morissette is a Canadian singer, songwriter, and actress. She was born on',
'June 1, 1974, in Ottawa, Ontario, Canada. Morissette began her career in the',
'music industry as a child, releasing her first album "Alanis" in 1991. However,',
'it was her third studio album, "Jagged Little Pill," released in 1995, that',
'brought her international fame and critical acclaim. The album sold over 33',
'million copies worldwide and produced hit singles such as "You Oughta Know,"',
'"Ironic," and "Hand in My Pocket." Throughout her career, Morissette has',
'continued to release successful albums and has received numerous awards,',
'including Grammy Awards, Juno Awards, and Billboard Music Awards. Her music',
'often explores themes of love, relationships, self-discovery, and spirituality.',
'Some of her other notable albums include "Supposed Former Infatuation Junkie,"',
'"Under Rug Swept," and "Flavors of Entanglement." In addition to her music',
'career, Alanis Morissette has also ventured into acting. She has appeared in',
'films such as "Dogma" and "Radio Free Albemuth," as well as on television shows',
'like "Weeds" and "Sex and the City." Offstage, Morissette has been open about',
'her struggles with mental health and has become an advocate for mental wellness.',
'She has also expressed her views on feminism and spirituality in her music and',
'interviews. Overall, Alanis Morissette has had a successful and influential',
'career in the music industry, with her powerful and emotional songs resonating',
'with audiences around the world.']
# Same question, return cached answer, save time, save money :)
start = time.time()
answer = await answer_question(index, question)
print(f"Time taken with cache: {time.time() - start}\n")
textwrap.wrap(answer, width=80)
Time taken with cache: 0.3175082206726074
['Alanis Morissette is a Canadian-American singer, songwriter, and actress. She',
'rose to fame in the 1990s with her breakthrough album "Jagged Little Pill,"',
'which became one of the best-selling albums of all time. Born on June 1, 1974,',
'in Ottawa, Ontario, Morissette began her career as a teen pop star in Canada',
'before transitioning to alternative rock. Throughout her career, Morissette has',
'released several successful albums and has won numerous awards, including',
'multiple Grammy Awards. Her music often explores themes of female empowerment,',
'personal introspection, and social commentary. Some of her notable songs include',
'"Ironic," "You Oughta Know," and "Hand in My Pocket." In addition to her music',
'career, Morissette has also acted in various films and television shows. She is',
'known for her roles in movies such as "Dogma" and "Jay and Silent Bob Strike',
'Back." Morissette has been transparent about her personal struggles, including',
'her experiences with eating disorders, depression, and postpartum depression.',
'She has used her platform to advocate for mental health awareness and has been',
'involved in various charitable causes. Overall, Alanis Morissette has had a',
'successful and influential career in the music industry while also making an',
'impact beyond music.']
# ask a semantically similar question returns the same answer from the cache
# but isn't exactly the same question. In this case, the semantic similarity between
# the questions is greater than the threshold of 0.8 the cache is set to.
start = time.time()
question = "Who is Alanis Morissette?"
answer = await answer_question(index, question)
print(f"Time taken with the cache: {time.time() - start}\n")
textwrap.wrap(answer, width=80)
Time taken with the cache: 0.26262593269348145
['Alanis Morissette is a Canadian-American singer, songwriter, and actress. She',
'rose to fame in the 1990s with her breakthrough album "Jagged Little Pill,"',
'which became one of the best-selling albums of all time. Born on June 1, 1974,',
'in Ottawa, Ontario, Morissette began her career as a teen pop star in Canada',
'before transitioning to alternative rock. Throughout her career, Morissette has',
'released several successful albums and has won numerous awards, including',
'multiple Grammy Awards. Her music often explores themes of female empowerment,',
'personal introspection, and social commentary. Some of her notable songs include',
'"Ironic," "You Oughta Know," and "Hand in My Pocket." In addition to her music',
'career, Morissette has also acted in various films and television shows. She is',
'known for her roles in movies such as "Dogma" and "Jay and Silent Bob Strike',
'Back." Morissette has been transparent about her personal struggles, including',
'her experiences with eating disorders, depression, and postpartum depression.',
'She has used her platform to advocate for mental health awareness and has been',
'involved in various charitable causes. Overall, Alanis Morissette has had a',
'successful and influential career in the music industry while also making an',
'impact beyond music.']
# Cleanup
await index.delete()