{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Question and Answer with OpenAI and RedisVL\n",
    "\n",
    "This example shows how to use RedisVL to create a question and answer system using OpenAI's API.\n",
    "\n",
    "In this notebook we will\n",
    "1. Download a dataset of wikipedia articles (thanks to OpenAI's CDN)\n",
    "2. Create embeddings for each article\n",
    "3. Create a RedisVL index and store the embeddings with metadata\n",
    "4. Construct a simple QnA system using the index and GPT-3\n",
    "5. Improve the QnA system with LLM caching\n",
    "\n",
    "\n",
    "The image below shows the architecture of the system we will create in this notebook.\n",
    "\n",
    "![Diagram](https://github.com/RedisVentures/redis-openai-qna/raw/main/app/assets/RedisOpenAI-QnA-Architecture.drawio.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "In order to run this example, you will need to have a Redis Stack running locally (or spin up for free on [Redis Cloud](https://redis.com/try-free)). You can do this by running the following command in your terminal:\n",
    "\n",
    "```bash\n",
    "docker run --name redis -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n",
    "```\n",
    "\n",
    "This will also provide the RedisInsight GUI at http://localhost:8001\n",
    "\n",
    "Next, we will install the dependencies for this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# first we need to install a few things\n",
    "\n",
    "%pip install pandas wget tenacity tiktoken openai==0.28.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import wget\n",
    "import pandas as pd\n",
    "\n",
    "embeddings_url = 'https://cdn.openai.com/API/examples/data/wikipedia_articles_2000.csv'\n",
    "\n",
    "wget.download(embeddings_url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>url</th>\n",
       "      <th>title</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3661</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Photon</td>\n",
       "      <td>Photon</td>\n",
       "      <td>Photons  (from Greek φως, meaning light), in m...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>7796</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Thomas%20Dolby</td>\n",
       "      <td>Thomas Dolby</td>\n",
       "      <td>Thomas Dolby (born Thomas Morgan Robertson; 14...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>67912</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Embroidery</td>\n",
       "      <td>Embroidery</td>\n",
       "      <td>Embroidery is the art of decorating fabric or ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>44309</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Consecutive%...</td>\n",
       "      <td>Consecutive integer</td>\n",
       "      <td>Consecutive numbers are numbers that follow ea...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>41741</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/German%20Empire</td>\n",
       "      <td>German Empire</td>\n",
       "      <td>The German Empire (\"Deutsches Reich\" or \"Deuts...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id                                                url  \\\n",
       "0   3661           https://simple.wikipedia.org/wiki/Photon   \n",
       "1   7796   https://simple.wikipedia.org/wiki/Thomas%20Dolby   \n",
       "2  67912       https://simple.wikipedia.org/wiki/Embroidery   \n",
       "3  44309  https://simple.wikipedia.org/wiki/Consecutive%...   \n",
       "4  41741  https://simple.wikipedia.org/wiki/German%20Empire   \n",
       "\n",
       "                 title                                               text  \n",
       "0               Photon  Photons  (from Greek φως, meaning light), in m...  \n",
       "1         Thomas Dolby  Thomas Dolby (born Thomas Morgan Robertson; 14...  \n",
       "2           Embroidery  Embroidery is the art of decorating fabric or ...  \n",
       "3  Consecutive integer  Consecutive numbers are numbers that follow ea...  \n",
       "4        German Empire  The German Empire (\"Deutsches Reich\" or \"Deuts...  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('wikipedia_articles_2000.csv')\n",
    "df = df.drop(columns=['Unnamed: 0'])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preparation\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text Chunking\n",
    "\n",
    "In order to create embeddings for the articles, we will need to chunk the text into smaller pieces. This is because there is a maximum length of text that can be sent to the OpenAI API. The code that follows pulls heavily from this [notebook](https://github.com/openai/openai-cookbook/blob/main/apps/enterprise-knowledge-retrieval/enterprise_knowledge_retrieval.ipynb) by OpenAI\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "TEXT_EMBEDDING_CHUNK_SIZE = 1000\n",
    "EMBEDDINGS_MODEL = \"text-embedding-ada-002\"\n",
    "\n",
    "\n",
    "def chunks(text, n, tokenizer):\n",
    "    tokens = tokenizer.encode(text)\n",
    "    \"\"\"Yield successive n-sized chunks from text.\n",
    "\n",
    "    Split a text into smaller chunks of size n, preferably ending at the end of a sentence\n",
    "    \"\"\"\n",
    "    i = 0\n",
    "    while i < len(tokens):\n",
    "        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens\n",
    "        j = min(i + int(1.5 * n), len(tokens))\n",
    "        while j > i + int(0.5 * n):\n",
    "            # Decode the tokens and check for full stop or newline\n",
    "            chunk = tokenizer.decode(tokens[i:j])\n",
    "            if chunk.endswith(\".\") or chunk.endswith(\"\\n\"):\n",
    "                break\n",
    "            j -= 1\n",
    "        # If no end of sentence found, use n tokens as the chunk size\n",
    "        if j == i + int(0.5 * n):\n",
    "            j = min(i + n, len(tokens))\n",
    "        yield tokens[i:j]\n",
    "        i = j\n",
    "\n",
    "def get_unique_id_for_file_chunk(title, chunk_index):\n",
    "    return str(title+\"-!\"+str(chunk_index))\n",
    "\n",
    "def chunk_text(record, tokenizer):\n",
    "    chunked_records = []\n",
    "\n",
    "    url = record['url']\n",
    "    title = record['title']\n",
    "    file_body_string = record['text']\n",
    "\n",
    "    \"\"\"Return a list of tuples (text_chunk, embedding) for a text.\"\"\"\n",
    "    token_chunks = list(chunks(file_body_string, TEXT_EMBEDDING_CHUNK_SIZE, tokenizer))\n",
    "    text_chunks = [f'Title: {title};\\n'+ tokenizer.decode(chunk) for chunk in token_chunks]\n",
    "\n",
    "    for i, text_chunk in enumerate(text_chunks):\n",
    "        doc_id = get_unique_id_for_file_chunk(title, i)\n",
    "        chunked_records.append(({\"id\": doc_id,\n",
    "                                \"url\": url,\n",
    "                                \"title\": title,\n",
    "                                \"content\": text_chunk,\n",
    "                                \"file_chunk_index\": i}))\n",
    "    return chunked_records"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialise tokenizer\n",
    "import tiktoken\n",
    "oai_tokenizer = tiktoken.get_encoding(\"cl100k_base\")\n",
    "\n",
    "records = []\n",
    "for _, record in df.iterrows():\n",
    "    records.extend(chunk_text(record, oai_tokenizer))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>url</th>\n",
       "      <th>title</th>\n",
       "      <th>content</th>\n",
       "      <th>file_chunk_index</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Photon-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Photon</td>\n",
       "      <td>Photon</td>\n",
       "      <td>Title: Photon;\\nPhotons  (from Greek φως, mean...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Photon-!1</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Photon</td>\n",
       "      <td>Photon</td>\n",
       "      <td>Title: Photon;\\nElementary particles</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Thomas Dolby-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Thomas%20Dolby</td>\n",
       "      <td>Thomas Dolby</td>\n",
       "      <td>Title: Thomas Dolby;\\nThomas Dolby (born Thoma...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Embroidery-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Embroidery</td>\n",
       "      <td>Embroidery</td>\n",
       "      <td>Title: Embroidery;\\nEmbroidery is the art of d...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Consecutive integer-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Consecutive%...</td>\n",
       "      <td>Consecutive integer</td>\n",
       "      <td>Title: Consecutive integer;\\nConsecutive numbe...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       id                                                url  \\\n",
       "0               Photon-!0           https://simple.wikipedia.org/wiki/Photon   \n",
       "1               Photon-!1           https://simple.wikipedia.org/wiki/Photon   \n",
       "2         Thomas Dolby-!0   https://simple.wikipedia.org/wiki/Thomas%20Dolby   \n",
       "3           Embroidery-!0       https://simple.wikipedia.org/wiki/Embroidery   \n",
       "4  Consecutive integer-!0  https://simple.wikipedia.org/wiki/Consecutive%...   \n",
       "\n",
       "                 title                                            content  \\\n",
       "0               Photon  Title: Photon;\\nPhotons  (from Greek φως, mean...   \n",
       "1               Photon               Title: Photon;\\nElementary particles   \n",
       "2         Thomas Dolby  Title: Thomas Dolby;\\nThomas Dolby (born Thoma...   \n",
       "3           Embroidery  Title: Embroidery;\\nEmbroidery is the art of d...   \n",
       "4  Consecutive integer  Title: Consecutive integer;\\nConsecutive numbe...   \n",
       "\n",
       "   file_chunk_index  \n",
       "0                 0  \n",
       "1                 1  \n",
       "2                 0  \n",
       "3                 0  \n",
       "4                 0  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chunked_data = pd.DataFrame(records)\n",
    "chunked_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Embedding Creation\n",
    "\n",
    "With the text broken up into chunks, we can create embeddings with the [`OpenAITextVectorizer`](https://www.redisvl.com/user_guide/vectorizers_04.html#openai). This provider uses the OpenAI API to create embeddings for the text. The code below shows how to create embeddings for the text chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>url</th>\n",
       "      <th>title</th>\n",
       "      <th>content</th>\n",
       "      <th>file_chunk_index</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Photon-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Photon</td>\n",
       "      <td>Photon</td>\n",
       "      <td>Title: Photon;\\nPhotons  (from Greek φως, mean...</td>\n",
       "      <td>0</td>\n",
       "      <td>b'\\x9e\\xbf\\xc9;\\xca\\x8e\\xfb;\\x00\\xf8P\\xbc\\xe5\\...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Photon-!1</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Photon</td>\n",
       "      <td>Photon</td>\n",
       "      <td>Title: Photon;\\nElementary particles</td>\n",
       "      <td>1</td>\n",
       "      <td>b'd\\xda#\\xbc\\xb7\\xf1\\x8c&lt;\\xea\\xd0m\\xbc\\x13\\x8b...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Thomas Dolby-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Thomas%20Dolby</td>\n",
       "      <td>Thomas Dolby</td>\n",
       "      <td>Title: Thomas Dolby;\\nThomas Dolby (born Thoma...</td>\n",
       "      <td>0</td>\n",
       "      <td>b'NG\\xce\\xbck\\xf0\\xb2;\\x81\\xed\\xd7\\xbc\\xb6\\x94...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Embroidery-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Embroidery</td>\n",
       "      <td>Embroidery</td>\n",
       "      <td>Title: Embroidery;\\nEmbroidery is the art of d...</td>\n",
       "      <td>0</td>\n",
       "      <td>b'\\xa4\\xba\\xf5\\xbcS\\xf3\\x02\\xbc\\xa1\\x15O\\xbc\\x...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Consecutive integer-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Consecutive%...</td>\n",
       "      <td>Consecutive integer</td>\n",
       "      <td>Title: Consecutive integer;\\nConsecutive numbe...</td>\n",
       "      <td>0</td>\n",
       "      <td>b'0(\\xfa\\xbb\\x81\\xd2\\xd9;\\xaf\\x92\\x9a;\\xd3FL\\x...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2688</th>\n",
       "      <td>Alanis Morissette-!1</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Alanis%20Mor...</td>\n",
       "      <td>Alanis Morissette</td>\n",
       "      <td>Title: Alanis Morissette;\\nTwin people from Ca...</td>\n",
       "      <td>1</td>\n",
       "      <td>b'Ii4\\xbc\\x8e&gt;\\xe0\\xbc\\x18]\\x07\\xbb%\\xa0\\x92\\x...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2689</th>\n",
       "      <td>Brontosaurus-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Brontosaurus</td>\n",
       "      <td>Brontosaurus</td>\n",
       "      <td>Title: Brontosaurus;\\nBrontosaurus  is a genus...</td>\n",
       "      <td>0</td>\n",
       "      <td>b'\\xad\\xa5\\xdb\\xbc\\xa5\\xa5\\xba:\\xb4\"\\x81\\xbc\\x...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2690</th>\n",
       "      <td>Work (physics)-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Work%20%28ph...</td>\n",
       "      <td>Work (physics)</td>\n",
       "      <td>Title: Work (physics);\\nIn physics, a force do...</td>\n",
       "      <td>0</td>\n",
       "      <td>b'\\x97\\x82\\xb9\\xbbL\\x90d\\xbc\\xb7G\\x9c\\xba\\x94g...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2691</th>\n",
       "      <td>Syllable-!0</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Syllable</td>\n",
       "      <td>Syllable</td>\n",
       "      <td>Title: Syllable;\\nA syllable is a unit of pron...</td>\n",
       "      <td>0</td>\n",
       "      <td>b'\\xe4\\xa3\\x1c:\\x83g\\x90&lt;\\x99=s;*[E\\xbb\\x10 \"\\...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2692</th>\n",
       "      <td>Syllable-!1</td>\n",
       "      <td>https://simple.wikipedia.org/wiki/Syllable</td>\n",
       "      <td>Syllable</td>\n",
       "      <td>Title: Syllable;\\nGrammar</td>\n",
       "      <td>1</td>\n",
       "      <td>b'T,-\\xbbS\\xe5\\x87;\\x1c\\x0f\\x9d:\\xc4\\xd4\\xcd:\\...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2693 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                          id  \\\n",
       "0                  Photon-!0   \n",
       "1                  Photon-!1   \n",
       "2            Thomas Dolby-!0   \n",
       "3              Embroidery-!0   \n",
       "4     Consecutive integer-!0   \n",
       "...                      ...   \n",
       "2688    Alanis Morissette-!1   \n",
       "2689         Brontosaurus-!0   \n",
       "2690       Work (physics)-!0   \n",
       "2691             Syllable-!0   \n",
       "2692             Syllable-!1   \n",
       "\n",
       "                                                    url                title  \\\n",
       "0              https://simple.wikipedia.org/wiki/Photon               Photon   \n",
       "1              https://simple.wikipedia.org/wiki/Photon               Photon   \n",
       "2      https://simple.wikipedia.org/wiki/Thomas%20Dolby         Thomas Dolby   \n",
       "3          https://simple.wikipedia.org/wiki/Embroidery           Embroidery   \n",
       "4     https://simple.wikipedia.org/wiki/Consecutive%...  Consecutive integer   \n",
       "...                                                 ...                  ...   \n",
       "2688  https://simple.wikipedia.org/wiki/Alanis%20Mor...    Alanis Morissette   \n",
       "2689     https://simple.wikipedia.org/wiki/Brontosaurus         Brontosaurus   \n",
       "2690  https://simple.wikipedia.org/wiki/Work%20%28ph...       Work (physics)   \n",
       "2691         https://simple.wikipedia.org/wiki/Syllable             Syllable   \n",
       "2692         https://simple.wikipedia.org/wiki/Syllable             Syllable   \n",
       "\n",
       "                                                content  file_chunk_index  \\\n",
       "0     Title: Photon;\\nPhotons  (from Greek φως, mean...                 0   \n",
       "1                  Title: Photon;\\nElementary particles                 1   \n",
       "2     Title: Thomas Dolby;\\nThomas Dolby (born Thoma...                 0   \n",
       "3     Title: Embroidery;\\nEmbroidery is the art of d...                 0   \n",
       "4     Title: Consecutive integer;\\nConsecutive numbe...                 0   \n",
       "...                                                 ...               ...   \n",
       "2688  Title: Alanis Morissette;\\nTwin people from Ca...                 1   \n",
       "2689  Title: Brontosaurus;\\nBrontosaurus  is a genus...                 0   \n",
       "2690  Title: Work (physics);\\nIn physics, a force do...                 0   \n",
       "2691  Title: Syllable;\\nA syllable is a unit of pron...                 0   \n",
       "2692                          Title: Syllable;\\nGrammar                 1   \n",
       "\n",
       "                                              embedding  \n",
       "0     b'\\x9e\\xbf\\xc9;\\xca\\x8e\\xfb;\\x00\\xf8P\\xbc\\xe5\\...  \n",
       "1     b'd\\xda#\\xbc\\xb7\\xf1\\x8c<\\xea\\xd0m\\xbc\\x13\\x8b...  \n",
       "2     b'NG\\xce\\xbck\\xf0\\xb2;\\x81\\xed\\xd7\\xbc\\xb6\\x94...  \n",
       "3     b'\\xa4\\xba\\xf5\\xbcS\\xf3\\x02\\xbc\\xa1\\x15O\\xbc\\x...  \n",
       "4     b'0(\\xfa\\xbb\\x81\\xd2\\xd9;\\xaf\\x92\\x9a;\\xd3FL\\x...  \n",
       "...                                                 ...  \n",
       "2688  b'Ii4\\xbc\\x8e>\\xe0\\xbc\\x18]\\x07\\xbb%\\xa0\\x92\\x...  \n",
       "2689  b'\\xad\\xa5\\xdb\\xbc\\xa5\\xa5\\xba:\\xb4\"\\x81\\xbc\\x...  \n",
       "2690  b'\\x97\\x82\\xb9\\xbbL\\x90d\\xbc\\xb7G\\x9c\\xba\\x94g...  \n",
       "2691  b'\\xe4\\xa3\\x1c:\\x83g\\x90<\\x99=s;*[E\\xbb\\x10 \"\\...  \n",
       "2692  b'T,-\\xbbS\\xe5\\x87;\\x1c\\x0f\\x9d:\\xc4\\xd4\\xcd:\\...  \n",
       "\n",
       "[2693 rows x 6 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import getpass\n",
    "\n",
    "from redisvl.utils.vectorize import OpenAITextVectorizer\n",
    "\n",
    "api_key = os.getenv(\"OPENAI_API_KEY\") or getpass.getpass(\"Enter your OpenAI API key: \")\n",
    "oaip = OpenAITextVectorizer(EMBEDDINGS_MODEL, api_config={\"api_key\": api_key})\n",
    "\n",
    "chunked_data[\"embedding\"] = oaip.embed_many(chunked_data[\"content\"].tolist(), as_buffer=True)\n",
    "chunked_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Construct the ``SearchIndex``\n",
    "\n",
    "Now that we have the embeddings, we can create a ``SearchIndex`` to store them in Redis. We will use the ``SearchIndex`` to store the embeddings and metadata for each article."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define the wikipedia `IndexSchema`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting wiki_schema.yaml\n"
     ]
    }
   ],
   "source": [
    "%%writefile wiki_schema.yaml\n",
    "\n",
    "version: '0.1.0'\n",
    "\n",
    "index:\n",
    "    name: wikipedia\n",
    "    prefix: chunk\n",
    "\n",
    "fields:\n",
    "    - name: content\n",
    "      type: text\n",
    "    - name: title\n",
    "      type: text\n",
    "    - name: id\n",
    "      type: tag\n",
    "    - name: embedding\n",
    "      type: vector\n",
    "      attrs:\n",
    "          dims: 1536\n",
    "          distance_metric: cosine\n",
    "          algorithm: flat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import redis.asyncio as redis\n",
    "\n",
    "from redisvl.index import AsyncSearchIndex\n",
    "from redisvl.schema import IndexSchema\n",
    "\n",
    "\n",
    "client = redis.Redis.from_url(\"redis://localhost:6379\")\n",
    "schema = IndexSchema.from_yaml(\"wiki_schema.yaml\")\n",
    "\n",
    "index = await AsyncSearchIndex(schema).set_client(client)\n",
    "\n",
    "await index.create()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32m16:00:26\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m   Indices:\n",
      "\u001b[32m16:00:26\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m   1. wikipedia\n"
     ]
    }
   ],
   "source": [
    "!rvl index listall"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the wikipedia dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "keys = await index.load(chunked_data.to_dict(orient=\"records\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a simple QnA System\n",
    "\n",
    "Now that we have the data and the embeddings, we can build the QnA system. The system will perform three actions\n",
    "\n",
    "1. Embed the user question and search for the most similar content\n",
    "2. Make a prompt with the query and retrieved content\n",
    "3. Send the prompt to the OpenAI API and return the answer\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import openai\n",
    "\n",
    "from redisvl.query import VectorQuery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "CHAT_MODEL = \"gpt-3.5-turbo\"\n",
    "\n",
    "def make_prompt(query, content):\n",
    "    retrieval_prompt = f'''Use the content to answer the search query the customer has sent.\n",
    "    If you can't answer the user's question, do not guess. If there is no content, respond with \"I don't know\".\n",
    "\n",
    "    Search query:\n",
    "\n",
    "    {query}\n",
    "\n",
    "    Content:\n",
    "\n",
    "    {content}\n",
    "\n",
    "    Answer:\n",
    "    '''\n",
    "    return retrieval_prompt\n",
    "\n",
    "async def retrieve_context(index: AsyncSearchIndex, query: str):\n",
    "    # Embed the query\n",
    "    query_embedding = await oaip.aembed(query)\n",
    "\n",
    "    # Get the top result from the index\n",
    "    vector_query = VectorQuery(\n",
    "        vector=query_embedding,\n",
    "        vector_field_name=\"embedding\",\n",
    "        return_fields=[\"content\"],\n",
    "        num_results=1\n",
    "    )\n",
    "\n",
    "    results = await index.query(vector_query)\n",
    "    content = \"\"\n",
    "    if len(results) > 1:\n",
    "        content = results[0][\"content\"]\n",
    "    return content\n",
    "\n",
    "async def answer_question(index: AsyncSearchIndex, query: str):\n",
    "    # Retrieve the context\n",
    "    content = await retrieve_context(index, query)\n",
    "\n",
    "    prompt = make_prompt(query, content)\n",
    "    retrieval = await openai.ChatCompletion.acreate(\n",
    "        model=CHAT_MODEL,\n",
    "        messages=[{'role':\"user\", 'content': prompt}],\n",
    "        max_tokens=50\n",
    "    )\n",
    "\n",
    "    # Response provided by GPT-3.5\n",
    "    return retrieval['choices'][0]['message']['content']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['A Brontosaurus, also known as Apatosaurus, is a type of large, long-necked',\n",
       " 'dinosaur that lived during the Late Jurassic Period, about 150 million years',\n",
       " 'ago. They were herbivores and belonged to the saurop']"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import textwrap\n",
    "\n",
    "question = \"What is a Brontosaurus?\"\n",
    "textwrap.wrap(await answer_question(index, question), width=80)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"I don't know.\""
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Question that makes no sense\n",
    "question = \"What is a trackiosamidon?\"\n",
    "await answer_question(index, question)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Alanis Morissette is a Canadian-American singer-songwriter and',\n",
       " 'actress. She gained international fame with her third studio album,',\n",
       " '\"Jagged Little Pill,\" released in 1995. The album went on to become a',\n",
       " 'massive success, selling over']"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question = \"Tell me about the life of Alanis Morissette\"\n",
    "textwrap.wrap(await answer_question(index, question))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Improve the QnA System with LLM caching\n",
    "\n",
    "The QnA system we built above is pretty good, but we can use the `SemanticCache` to improve the throughput and stability. The ``SemanticCache`` will store the results of previous queries and return them if the query is similar enough to a previous query. This will reduce the number of round trip queries we need to send to the OpenAI API.\n",
    "\n",
    "> Note this technique will work assuming we expect a similar profile of queries to be asked."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from redisvl.extensions.llmcache import SemanticCache\n",
    "\n",
    "cache = SemanticCache(redis_url=\"redis://localhost:6379\", distance_threshold=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def answer_question(index: AsyncSearchIndex, query: str):\n",
    "\n",
    "    # check the cache\n",
    "    if result := cache.check(prompt=query):\n",
    "        return result[0]['response']\n",
    "\n",
    "    # Retrieve the context\n",
    "    content = await retrieve_context(index, query)\n",
    "\n",
    "    prompt = make_prompt(query, content)\n",
    "    retrieval = await openai.ChatCompletion.acreate(\n",
    "        model=CHAT_MODEL,\n",
    "        messages=[{'role':\"user\", 'content': prompt}],\n",
    "        max_tokens=500\n",
    "    )\n",
    "\n",
    "    # Response provided by GPT-3.5\n",
    "    answer = retrieval['choices'][0]['message']['content']\n",
    "\n",
    "    # cache the query_embedding and answer\n",
    "    cache.store(query, answer)\n",
    "    return answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 6.253775119781494\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['Alanis Morissette is a Canadian singer, songwriter, and actress. She was born on',\n",
       " 'June 1, 1974, in Ottawa, Ontario, Canada. Morissette began her career in the',\n",
       " 'music industry as a child, releasing her first album \"Alanis\" in 1991. However,',\n",
       " 'it was her third studio album, \"Jagged Little Pill,\" released in 1995, that',\n",
       " 'brought her international fame and critical acclaim. The album sold over 33',\n",
       " 'million copies worldwide and produced hit singles such as \"You Oughta Know,\"',\n",
       " '\"Ironic,\" and \"Hand in My Pocket.\"  Throughout her career, Morissette has',\n",
       " 'continued to release successful albums and has received numerous awards,',\n",
       " 'including Grammy Awards, Juno Awards, and Billboard Music Awards. Her music',\n",
       " 'often explores themes of love, relationships, self-discovery, and spirituality.',\n",
       " 'Some of her other notable albums include \"Supposed Former Infatuation Junkie,\"',\n",
       " '\"Under Rug Swept,\" and \"Flavors of Entanglement.\"  In addition to her music',\n",
       " 'career, Alanis Morissette has also ventured into acting. She has appeared in',\n",
       " 'films such as \"Dogma\" and \"Radio Free Albemuth,\" as well as on television shows',\n",
       " 'like \"Weeds\" and \"Sex and the City.\"  Offstage, Morissette has been open about',\n",
       " 'her struggles with mental health and has become an advocate for mental wellness.',\n",
       " 'She has also expressed her views on feminism and spirituality in her music and',\n",
       " 'interviews.  Overall, Alanis Morissette has had a successful and influential',\n",
       " 'career in the music industry, with her powerful and emotional songs resonating',\n",
       " 'with audiences around the world.']"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ask a question to cache an answer\n",
    "import time\n",
    "start = time.time()\n",
    "question = \"Tell me about the life of Alanis Morissette\"\n",
    "answer = await answer_question(index, question)\n",
    "print(f\"Time taken: {time.time() - start}\\n\")\n",
    "textwrap.wrap(answer, width=80)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken with cache: 0.3175082206726074\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['Alanis Morissette is a Canadian-American singer, songwriter, and actress. She',\n",
       " 'rose to fame in the 1990s with her breakthrough album \"Jagged Little Pill,\"',\n",
       " 'which became one of the best-selling albums of all time. Born on June 1, 1974,',\n",
       " 'in Ottawa, Ontario, Morissette began her career as a teen pop star in Canada',\n",
       " 'before transitioning to alternative rock.  Throughout her career, Morissette has',\n",
       " 'released several successful albums and has won numerous awards, including',\n",
       " 'multiple Grammy Awards. Her music often explores themes of female empowerment,',\n",
       " 'personal introspection, and social commentary. Some of her notable songs include',\n",
       " '\"Ironic,\" \"You Oughta Know,\" and \"Hand in My Pocket.\"   In addition to her music',\n",
       " 'career, Morissette has also acted in various films and television shows. She is',\n",
       " 'known for her roles in movies such as \"Dogma\" and \"Jay and Silent Bob Strike',\n",
       " 'Back.\"  Morissette has been transparent about her personal struggles, including',\n",
       " 'her experiences with eating disorders, depression, and postpartum depression.',\n",
       " 'She has used her platform to advocate for mental health awareness and has been',\n",
       " 'involved in various charitable causes.  Overall, Alanis Morissette has had a',\n",
       " 'successful and influential career in the music industry while also making an',\n",
       " 'impact beyond music.']"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Same question, return cached answer, save time, save money :)\n",
    "start = time.time()\n",
    "answer = await answer_question(index, question)\n",
    "print(f\"Time taken with cache: {time.time() - start}\\n\")\n",
    "textwrap.wrap(answer, width=80)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken with the cache: 0.26262593269348145\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['Alanis Morissette is a Canadian-American singer, songwriter, and actress. She',\n",
       " 'rose to fame in the 1990s with her breakthrough album \"Jagged Little Pill,\"',\n",
       " 'which became one of the best-selling albums of all time. Born on June 1, 1974,',\n",
       " 'in Ottawa, Ontario, Morissette began her career as a teen pop star in Canada',\n",
       " 'before transitioning to alternative rock.  Throughout her career, Morissette has',\n",
       " 'released several successful albums and has won numerous awards, including',\n",
       " 'multiple Grammy Awards. Her music often explores themes of female empowerment,',\n",
       " 'personal introspection, and social commentary. Some of her notable songs include',\n",
       " '\"Ironic,\" \"You Oughta Know,\" and \"Hand in My Pocket.\"   In addition to her music',\n",
       " 'career, Morissette has also acted in various films and television shows. She is',\n",
       " 'known for her roles in movies such as \"Dogma\" and \"Jay and Silent Bob Strike',\n",
       " 'Back.\"  Morissette has been transparent about her personal struggles, including',\n",
       " 'her experiences with eating disorders, depression, and postpartum depression.',\n",
       " 'She has used her platform to advocate for mental health awareness and has been',\n",
       " 'involved in various charitable causes.  Overall, Alanis Morissette has had a',\n",
       " 'successful and influential career in the music industry while also making an',\n",
       " 'impact beyond music.']"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ask a semantically similar question returns the same answer from the cache\n",
    "# but isn't exactly the same question. In this case, the semantic similarity between\n",
    "# the questions is greater than the threshold of 0.8 the cache is set to.\n",
    "start = time.time()\n",
    "question = \"Who is Alanis Morissette?\"\n",
    "answer = await answer_question(index, question)\n",
    "print(f\"Time taken with the cache: {time.time() - start}\\n\")\n",
    "textwrap.wrap(answer, width=80)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cleanup\n",
    "await index.delete()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "rvl",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}