{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Hash vs JSON Storage\n", "\n", "\n", "Out of the box, Redis provides a [variety of data structures](https://redis.com/redis-enterprise/data-structures/) that can adapt to your domain specific applications and use cases.\n", "In this notebook, we will demonstrate how to use RedisVL with both [Hash](https://redis.io/docs/data-types/hashes/) and [JSON](https://redis.io/docs/data-types/json/) data.\n", "\n", "\n", "Before running this notebook, be sure to\n", "1. Have installed ``redisvl`` and have that environment active for this notebook.\n", "2. Have a running Redis Stack or Redis Enterprise instance with RediSearch > 2.4 activated.\n", "\n", "For example, you can run [Redis Stack](https://redis.io/docs/install/install-stack/) locally with Docker:\n", "\n", "```bash\n", "docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n", "```\n", "\n", "Or create a [FREE Redis Cloud](https://redis.com/try-free)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import necessary modules\n", "import pickle\n", "\n", "from redisvl.redis.utils import buffer_to_array\n", "from redisvl.index import SearchIndex\n", "\n", "\n", "# load in the example data and printing utils\n", "data = pickle.load(open(\"hybrid_example_data.pkl\", \"rb\"))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
useragejobcredit_scoreoffice_locationuser_embedding
john18engineerhigh-122.4194,37.7749b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'
derrick14doctorlow-122.4194,37.7749b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'
nancy94doctorhigh-122.4194,37.7749b'333?\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'
tyler100engineerhigh-122.0839,37.3861b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc>\\x00\\x00\\x00?'
tim12dermatologisthigh-122.0839,37.3861b'\\xcd\\xcc\\xcc>\\xcd\\xcc\\xcc>\\x00\\x00\\x00?'
taimur15CEOlow-122.0839,37.3861b'\\x9a\\x99\\x19?\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'
joe35dentistmedium-122.0839,37.3861b'fff?fff?\\xcd\\xcc\\xcc='
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from jupyterutils import result_print, table_print\n", "\n", "table_print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hash or JSON -- how to choose?\n", "Both storage options offer a variety of features and tradeoffs. Below we will work through a dummy dataset to learn when and how to use both." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Hashes\n", "Hashes in Redis are simple collections of field-value pairs. Think of it like a mutable single-level dictionary contains multiple \"rows\":\n", "\n", "\n", "```python\n", "{\n", " \"model\": \"Deimos\",\n", " \"brand\": \"Ergonom\",\n", " \"type\": \"Enduro bikes\",\n", " \"price\": 4972,\n", "}\n", "```\n", "\n", "Hashes are best suited for use cases with the following characteristics:\n", "- Performance (speed) and storage space (memory consumption) are top concerns\n", "- Data can be easily normalized and modeled as a single-level dict\n", "\n", "> Hashes are typically the default recommendation." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# define the hash index schema\n", "hash_schema = {\n", " \"index\": {\n", " \"name\": \"user-hash\",\n", " \"prefix\": \"user-hash-docs\",\n", " \"storage_type\": \"hash\", # default setting -- HASH\n", " },\n", " \"fields\": [\n", " {\"name\": \"user\", \"type\": \"tag\"},\n", " {\"name\": \"credit_score\", \"type\": \"tag\"},\n", " {\"name\": \"job\", \"type\": \"text\"},\n", " {\"name\": \"age\", \"type\": \"numeric\"},\n", " {\"name\": \"office_location\", \"type\": \"geo\"},\n", " {\n", " \"name\": \"user_embedding\",\n", " \"type\": \"vector\",\n", " \"attrs\": {\n", " \"dims\": 3,\n", " \"distance_metric\": \"cosine\",\n", " \"algorithm\": \"flat\",\n", " \"datatype\": \"float32\"\n", " }\n", "\n", " }\n", " ],\n", "}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# construct a search index from the hash schema\n", "hindex = SearchIndex.from_dict(hash_schema)\n", "\n", "# connect to local redis instance\n", "hindex.connect(\"redis://localhost:6379\")\n", "\n", "# create the index (no data yet)\n", "hindex.create(overwrite=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show the underlying storage type\n", "hindex.storage_type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Vectors as byte strings\n", "One nuance when working with Hashes in Redis, is that all vectorized data must be passed as a byte string (for efficient storage, indexing, and processing). An example of that can be seen below:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user': 'john',\n", " 'age': 18,\n", " 'job': 'engineer',\n", " 'credit_score': 'high',\n", " 'office_location': '-122.4194,37.7749',\n", " 'user_embedding': b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show a single entry from the data that will be loaded\n", "data[0]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# load hash data\n", "keys = hindex.load(data)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Statistics:\n", "╭─────────────────────────────┬─────────────╮\n", "│ Stat Key │ Value │\n", "├─────────────────────────────┼─────────────┤\n", "│ num_docs │ 7 │\n", "│ num_terms │ 6 │\n", "│ max_doc_id │ 7 │\n", "│ num_records │ 44 │\n", "│ percent_indexed │ 1 │\n", "│ hash_indexing_failures │ 0 │\n", "│ number_of_uses │ 1 │\n", "│ bytes_per_record_avg │ 3.40909 │\n", "│ doc_table_size_mb │ 0.000767708 │\n", "│ inverted_sz_mb │ 0.000143051 │\n", "│ key_table_size_mb │ 0.000248909 │\n", "│ offset_bits_per_record_avg │ 8 │\n", "│ offset_vectors_sz_mb │ 8.58307e-06 │\n", "│ offsets_per_term_avg │ 0.204545 │\n", "│ records_per_doc_avg │ 6.28571 │\n", "│ sortable_values_size_mb │ 0 │\n", "│ total_indexing_time │ 1.053 │\n", "│ total_inverted_index_blocks │ 18 │\n", "│ vector_index_sz_mb │ 0.0202332 │\n", "╰─────────────────────────────┴─────────────╯\n" ] } ], "source": [ "!rvl stats -i user-hash" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Performing Queries\n", "Once our index is created and data is loaded into the right format, we can run queries against the index with RedisVL:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
vector_distanceusercredit_scoreagejoboffice_location
0johnhigh18engineer-122.4194,37.7749
0.109129190445tylerhigh100engineer-122.0839,37.3861
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from redisvl.query import VectorQuery\n", "from redisvl.query.filter import Tag, Text, Num\n", "\n", "t = (Tag(\"credit_score\") == \"high\") & (Text(\"job\") % \"enginee*\") & (Num(\"age\") > 17)\n", "\n", "v = VectorQuery([0.1, 0.1, 0.5],\n", " \"user_embedding\",\n", " return_fields=[\"user\", \"credit_score\", \"age\", \"job\", \"office_location\"],\n", " filter_expression=t)\n", "\n", "\n", "results = hindex.query(v)\n", "result_print(results)\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# clean up\n", "hindex.delete()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with JSON\n", "Redis also supports native **JSON** objects. These can be multi-level (nested) objects, with full JSONPath support for updating/retrieving sub elements:\n", "\n", "```python\n", "{\n", " \"name\": \"bike\",\n", " \"metadata\": {\n", " \"model\": \"Deimos\",\n", " \"brand\": \"Ergonom\",\n", " \"type\": \"Enduro bikes\",\n", " \"price\": 4972,\n", " }\n", "}\n", "```\n", "\n", "JSON is best suited for use cases with the following characteristics:\n", "- Ease of use and data model flexibility are top concerns\n", "- Application data is already native JSON\n", "- Replacing another document storage/db solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Full JSON Path support\n", "Because Redis enables full JSON path support, when creating an index schema, elements need to be indexed and selected by their path with the desired `name` AND `path` that points to where the data is located within the objects.\n", "\n", "> By default, RedisVL will assume the path as `$.{name}` if not provided in JSON fields schema." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# define the json index schema\n", "json_schema = {\n", " \"index\": {\n", " \"name\": \"user-json\",\n", " \"prefix\": \"user-json-docs\",\n", " \"storage_type\": \"json\", # JSON storage type\n", " },\n", " \"fields\": [\n", " {\"name\": \"user\", \"type\": \"tag\"},\n", " {\"name\": \"credit_score\", \"type\": \"tag\"},\n", " {\"name\": \"job\", \"type\": \"text\"},\n", " {\"name\": \"age\", \"type\": \"numeric\"},\n", " {\"name\": \"office_location\", \"type\": \"geo\"},\n", " {\n", " \"name\": \"user_embedding\",\n", " \"type\": \"vector\",\n", " \"attrs\": {\n", " \"dims\": 3,\n", " \"distance_metric\": \"cosine\",\n", " \"algorithm\": \"flat\",\n", " \"datatype\": \"float32\"\n", " }\n", "\n", " }\n", " ],\n", "}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# construct a search index from the json schema\n", "jindex = SearchIndex.from_dict(json_schema)\n", "\n", "# connect to local redis instance\n", "jindex.connect(\"redis://localhost:6379\")\n", "\n", "# create the index (no data yet)\n", "jindex.create(overwrite=True)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32m11:54:18\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m Indices:\n", "\u001b[32m11:54:18\u001b[0m \u001b[34m[RedisVL]\u001b[0m \u001b[1;30mINFO\u001b[0m 1. user-json\n" ] } ], "source": [ "# note the multiple indices in the same database\n", "!rvl index listall" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Vectors as float arrays\n", "Vectorized data stored in JSON must be stored as a pure array (python list) of floats. We will modify our sample data to account for this below:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "json_data = data.copy()\n", "\n", "for d in json_data:\n", " d['user_embedding'] = buffer_to_array(d['user_embedding'], dtype=np.float32)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user': 'john',\n", " 'age': 18,\n", " 'job': 'engineer',\n", " 'credit_score': 'high',\n", " 'office_location': '-122.4194,37.7749',\n", " 'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5]}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspect a single JSON record\n", "json_data[0]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "keys = jindex.load(json_data)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
vector_distanceusercredit_scoreagejoboffice_location
0johnhigh18engineer-122.4194,37.7749
0.109129190445tylerhigh100engineer-122.0839,37.3861
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# we can now run the exact same query as above\n", "result_print(jindex.query(v))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "jindex.delete()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('redisvl2')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "9b1e6e9c2967143209c2f955cb869d1d3234f92dc4787f49f155f3abbdfb1316" } } }, "nbformat": 4, "nbformat_minor": 2 }