Join us and get access to thousands of tutorials and a community of expert Pythonistas.

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Preparing and Loading the Data

Vector Databases and Embeddings With ChromaDB Joseph Peart 06:46

Resources mentioned in this lesson:

00:00 Your fictional car dealership needs some reviews. You’ll be sourcing them from the Edmunds Consumer Car Ratings and Reviews dataset on Kaggle at this link.

00:09 Navigate to the page, select Download in the top right, and then Download Dataset as Zip. You’ll get the dataset as a zip file full of CSVs. Unzip the dataset and place the CSVs in your project under data/archive/.

00:24 Then make sure these files from the course resources are in your project: car_data_etl.py and chroma_utils.py. And now your project directory should look like this: the car reviews under data/archive/, car_data_etl.py, which will handle preparing the data for ChromaDB, and chroma_utils.py, which builds the ChromaDB collection.

00:48 Let’s take a closer look at these two files.

00:52 I won’t go through the file line by line, but here are the highlights. The prepare_car_reviews_data() function is the reason for the file. It takes a path and a list of vehicle years, defaulting to a list with only the value 2017. The dataset is loaded into Polars using the scan_csv() function.

01:10 If you’re familiar with pandas, Polars is a little different because it processes datasets lazily, queuing up instructions until you explicitly ask for the processed data.

01:19 This pattern opens up opportunities for increased speed and efficiency. And the real action here is from lines 28 to 53. The vehicle years are extracted into their own column, and then the data is filtered to include only the years in the list of vehicle years passed into the function.

01:36 Then the DataFrame is sorted, first by vehicle model, then rating. And the call to collect() executes all of the previously queued instructions, returning the processed dataset.

01:47 Towards the end of the function, IDs are created based on the index positions of entries, documents created by converting the reviews into a list, and all other columns are turned into their metadata. Finally, the function returns a dictionary of IDs, documents, and metadata, all ready for ChromaDB.

02:04 If you’re interested in having a closer look at the Polars library, I recommend this tutorial, Python Polars: A Lightning-Fast DataFrame Library.

02:41 First, it creates a Chroma client using the supplied chroma_path argument. Then instantiates the embedding function, the collection, and creates a list of indices.

02:51 Finally, it uses the batched() function from more-itertools to split the documents into chunks for insertion into the ChromaDB collection.

02:58 The reason for this batching is because there’s a limit to the number of documents that can be inserted into a ChromaDB collection at once. It used to be 166.

03:08 Newer versions of ChromaDB have larger limits based on system resources, but for safety and backwards compatibility, 166 is a good choice. And the function returns nothing, since its purpose is just to build the collection.

03:20 Speaking of, you’ve got a dataset to build, so open the REPL and get started. Begin with some familiar imports. import chromadb, import embedding_functions from chromadb.utils,

03:35 import prepare_car_reviews_data from car_data_etl,

03:41 and import build_chroma_collection from chroma_utils. Define some constants. DATA_PATH equals the string "data/archive/*", and that star at the end will select all of the CSVs in the archive.

03:57 CHROMA_PATH as the string "car_review_embeddings",

04:02 EMBEDDING_FUNC_NAME as the string "multi-qa-MiniLM-L6-cos-v1",

04:09 and you might notice this is a different model. This one was trained specifically for question and answer type semantic search tasks. And COLLECTION_NAME as "car_reviews".

04:21 Go ahead and run prepare_car_reviews_data(), passing in DATA_PATH, and store the result in the variable chroma_car_reviews_dict.

04:32 And now you have everything you need to run build_chroma_collection(). build_chroma_collection(), passing in CHROMA_PATH, COLLECTION_NAME, EMBEDDING_FUNC_NAME,

04:48 chroma_car_reviews_dict at the key "ids",

04:52 chroma_car_reviews_dict at the key "documents",

04:58 and chroma_car_reviews_dict at the key "metadatas".

05:04 Ignore those warnings if you get them and instantiate the ChromaDB client. client equals chromadb.PersistentClient(), passing in CHROMA_PATH, and to make a query you’ll need to instantiate the embedding function again.

05:20 embedding_func equals embedding_functions.SentenceTransformerEmbeddingFunction(),

05:27 passing in model_name equals EMBEDDING_FUNC_NAME,

05:32 and load the collection this time. collection equals client.get_collection(), passing in name equals COLLECTION_NAME, embedding_function equals embedding_func.

06:16 And check that first great review. great_reviews at the key "documents" at index 0 at index 0 says "Great all-around car with great balance of performance and comfort. Terrific technology too." Nice. All you asked for were positive reviews discussing performance, and that’s exactly what you got.

06:36 To me, that sure looks like a success for ChromaDB. And it looks like you’re all set up and ready for the final piece of the puzzle, connecting to an LLM.

06:44 See you in the next lesson.

Become a Member to join the conversation.