Hash vs JSON Storage#
Out of the box, Redis provides a variety of data structures that can adapt to your domain specific applications and use cases. In this notebook, we will demonstrate how to use RedisVL with both Hash and JSON data.
Before running this notebook, be sure to
Have installed
redisvl
and have that environment active for this notebook.Have a running Redis Stack or Redis Enterprise instance with RediSearch > 2.4 activated.
For example, you can run Redis Stack locally with Docker:
docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
Or create a FREE Redis Cloud.
# import necessary modules
import pickle
from redisvl.redis.utils import buffer_to_array
from redisvl.index import SearchIndex
# load in the example data and printing utils
data = pickle.load(open("hybrid_example_data.pkl", "rb"))
from jupyterutils import result_print, table_print
table_print(data)
user | age | job | credit_score | office_location | user_embedding |
---|---|---|---|---|---|
john | 18 | engineer | high | -122.4194,37.7749 | b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?' |
derrick | 14 | doctor | low | -122.4194,37.7749 | b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?' |
nancy | 94 | doctor | high | -122.4194,37.7749 | b'333?\xcd\xcc\xcc=\x00\x00\x00?' |
tyler | 100 | engineer | high | -122.0839,37.3861 | b'\xcd\xcc\xcc=\xcd\xcc\xcc>\x00\x00\x00?' |
tim | 12 | dermatologist | high | -122.0839,37.3861 | b'\xcd\xcc\xcc>\xcd\xcc\xcc>\x00\x00\x00?' |
taimur | 15 | CEO | low | -122.0839,37.3861 | b'\x9a\x99\x19?\xcd\xcc\xcc=\x00\x00\x00?' |
joe | 35 | dentist | medium | -122.0839,37.3861 | b'fff?fff?\xcd\xcc\xcc=' |
Hash or JSON – how to choose?#
Both storage options offer a variety of features and tradeoffs. Below we will work through a dummy dataset to learn when and how to use both.
Working with Hashes#
Hashes in Redis are simple collections of field-value pairs. Think of it like a mutable single-level dictionary contains multiple “rows”:
{
"model": "Deimos",
"brand": "Ergonom",
"type": "Enduro bikes",
"price": 4972,
}
Hashes are best suited for use cases with the following characteristics:
Performance (speed) and storage space (memory consumption) are top concerns
Data can be easily normalized and modeled as a single-level dict
Hashes are typically the default recommendation.
# define the hash index schema
hash_schema = {
"index": {
"name": "user-hash",
"prefix": "user-hash-docs",
"storage_type": "hash", # default setting -- HASH
},
"fields": [
{"name": "user", "type": "tag"},
{"name": "credit_score", "type": "tag"},
{"name": "job", "type": "text"},
{"name": "age", "type": "numeric"},
{"name": "office_location", "type": "geo"},
{
"name": "user_embedding",
"type": "vector",
"attrs": {
"dims": 3,
"distance_metric": "cosine",
"algorithm": "flat",
"datatype": "float32"
}
}
],
}
# construct a search index from the hash schema
hindex = SearchIndex.from_dict(hash_schema)
# connect to local redis instance
hindex.connect("redis://localhost:6379")
# create the index (no data yet)
hindex.create(overwrite=True)
# show the underlying storage type
hindex.storage_type
<StorageType.HASH: 'hash'>
Vectors as byte strings#
One nuance when working with Hashes in Redis, is that all vectorized data must be passed as a byte string (for efficient storage, indexing, and processing). An example of that can be seen below:
# show a single entry from the data that will be loaded
data[0]
{'user': 'john',
'age': 18,
'job': 'engineer',
'credit_score': 'high',
'office_location': '-122.4194,37.7749',
'user_embedding': b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'}
# load hash data
keys = hindex.load(data)
!rvl stats -i user-hash
Statistics:
╭─────────────────────────────┬─────────────╮
│ Stat Key │ Value │
├─────────────────────────────┼─────────────┤
│ num_docs │ 7 │
│ num_terms │ 6 │
│ max_doc_id │ 7 │
│ num_records │ 44 │
│ percent_indexed │ 1 │
│ hash_indexing_failures │ 0 │
│ number_of_uses │ 1 │
│ bytes_per_record_avg │ 3.40909 │
│ doc_table_size_mb │ 0.000767708 │
│ inverted_sz_mb │ 0.000143051 │
│ key_table_size_mb │ 0.000248909 │
│ offset_bits_per_record_avg │ 8 │
│ offset_vectors_sz_mb │ 8.58307e-06 │
│ offsets_per_term_avg │ 0.204545 │
│ records_per_doc_avg │ 6.28571 │
│ sortable_values_size_mb │ 0 │
│ total_indexing_time │ 1.053 │
│ total_inverted_index_blocks │ 18 │
│ vector_index_sz_mb │ 0.0202332 │
╰─────────────────────────────┴─────────────╯
Performing Queries#
Once our index is created and data is loaded into the right format, we can run queries against the index with RedisVL:
from redisvl.query import VectorQuery
from redisvl.query.filter import Tag, Text, Num
t = (Tag("credit_score") == "high") & (Text("job") % "enginee*") & (Num("age") > 17)
v = VectorQuery([0.1, 0.1, 0.5],
"user_embedding",
return_fields=["user", "credit_score", "age", "job", "office_location"],
filter_expression=t)
results = hindex.query(v)
result_print(results)
vector_distance | user | credit_score | age | job | office_location |
---|---|---|---|---|---|
0 | john | high | 18 | engineer | -122.4194,37.7749 |
0.109129190445 | tyler | high | 100 | engineer | -122.0839,37.3861 |
# clean up
hindex.delete()
Working with JSON#
Redis also supports native JSON objects. These can be multi-level (nested) objects, with full JSONPath support for updating/retrieving sub elements:
{
"name": "bike",
"metadata": {
"model": "Deimos",
"brand": "Ergonom",
"type": "Enduro bikes",
"price": 4972,
}
}
JSON is best suited for use cases with the following characteristics:
Ease of use and data model flexibility are top concerns
Application data is already native JSON
Replacing another document storage/db solution
Full JSON Path support#
Because Redis enables full JSON path support, when creating an index schema, elements need to be indexed and selected by their path with the desired name
AND path
that points to where the data is located within the objects.
By default, RedisVL will assume the path as
$.{name}
if not provided in JSON fields schema.
# define the json index schema
json_schema = {
"index": {
"name": "user-json",
"prefix": "user-json-docs",
"storage_type": "json", # JSON storage type
},
"fields": [
{"name": "user", "type": "tag"},
{"name": "credit_score", "type": "tag"},
{"name": "job", "type": "text"},
{"name": "age", "type": "numeric"},
{"name": "office_location", "type": "geo"},
{
"name": "user_embedding",
"type": "vector",
"attrs": {
"dims": 3,
"distance_metric": "cosine",
"algorithm": "flat",
"datatype": "float32"
}
}
],
}
# construct a search index from the json schema
jindex = SearchIndex.from_dict(json_schema)
# connect to local redis instance
jindex.connect("redis://localhost:6379")
# create the index (no data yet)
jindex.create(overwrite=True)
# note the multiple indices in the same database
!rvl index listall
11:54:18 [RedisVL] INFO Indices:
11:54:18 [RedisVL] INFO 1. user-json
Vectors as float arrays#
Vectorized data stored in JSON must be stored as a pure array (python list) of floats. We will modify our sample data to account for this below:
import numpy as np
json_data = data.copy()
for d in json_data:
d['user_embedding'] = buffer_to_array(d['user_embedding'], dtype='float32')
# inspect a single JSON record
json_data[0]
{'user': 'john',
'age': 18,
'job': 'engineer',
'credit_score': 'high',
'office_location': '-122.4194,37.7749',
'user_embedding': [0.10000000149011612, 0.10000000149011612, 0.5]}
keys = jindex.load(json_data)
# we can now run the exact same query as above
result_print(jindex.query(v))
vector_distance | user | credit_score | age | job | office_location |
---|---|---|---|---|---|
0 | john | high | 18 | engineer | -122.4194,37.7749 |
0.109129190445 | tyler | high | 100 | engineer | -122.0839,37.3861 |
Cleanup#
jindex.delete()