Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Introducing Vector Databases

00:00 So where can you store your embeddings? A vector database. Vector databases are designed for efficient vector storage and retrieval. You can use them to store embeddings representing text, images, or any other kind of unstructured data.

00:14 They allow fast similarity search across large datasets. And instead of returning exact matches, these searches find the closest vectors using similarity metrics.

00:24 And that’s how vector databases enable applications like semantic search, recommendation systems, and of course, providing relevant context to LLMs.

00:34 Vector databases generally share some main characteristics. They use embedding functions to convert raw data to embeddings. And they have some similarity metric like cosine similarity to compare embeddings.

00:46 They’ll use indexing to group together similar embeddings and speed up searches. They also store metadata about embeddings to filter and contextualize search results.

00:57 As databases, they need a storage location to keep embeddings and metadata. That can be in memory for speed, on disk for persistence, or a combination of both.

01:07 And they use CRUD operations to create, read, update, and delete embeddings, also like a traditional database. It’s finally time to use ChromaDB. ChromaDB is an open-source vector database made to work with LLMs.

01:24 It provides a developer-friendly API for working with embeddings and is optimized for fast similarity search and semantic retrieval. Install it in your virtual environment by running the command python -m pip install chromadb.

01:38 And if you encounter any issues with the installation, review the link below.

01:44 And here’s a couple more useful links, the getting started guide and the general API reference. Make sure you’re installing ChromaDB in the same virtual environment as the previous modules you’ve installed and open up the REPL.

01:59 First, import chromadb. And from chromadb.utils, import embedding_functions. And assign a few constants. CHROMA_DATA_PATH equals the string "chroma_data/".

02:14 This is where Chroma will create a SQLite database for persistent storage. EMBED_MODEL equals "all-MiniLM-L6-v2". And this is the embedding model you’ll be using, which will be pulled from sentence-transformers under the hood. COLLECTION_NAME equals the string "demo_docs".

02:35 And this will be the name of the collection you’ll create in Chroma. And you’re ready to instantiate a client by running chromadb.PersistentClient(), passing in CHROMA_DATA_PATH.

02:52 Instantiate the embedding function by calling embedding_functions.SentenceTransformerEmbeddingFunction()

03:01 and setting model_name to EMBED_MODEL.

03:05 Again, a couple warnings here, but you can ignore them. And create your collection. collection equals client.create_collection(),

03:16 passing in name equals COLLECTION_NAME, embedding_function equals embedding_func,

03:23 metadata equals a dictionary, the key "hnsw:space" with the value "cosine". You’re using the client that you created a few lines above, the collection name, "demo_docs", which is like the table name in a traditional database, the embedding function that came from sentence-transformers, and metadata, which specifies that you want Chroma to use the Hierarchical Navigable Small World algorithm for building an index, and cosine distance as the similarity metric for comparing vectors. Now you need some documents to populate your database.

04:00 And you might want to just copy and paste these from the course resources.

04:17 And I mentioned vector databases can also store metadata. So create a list of genres that apply to each of these documents. In a real production system, you’d probably use a different model to classify your texts into genres. But for our purposes, you can label them by hand.

04:33 genres equals a list of strings, "technology", "travel",

04:39 "science", "food", "history", "fitness", "art", "climate change", "business", and "music".

04:51 And add everything to the collection. Call collection.add(), passing in documents equals documents,

05:01 ids equals a list comprehension using f-strings to create an ID for each document based on its index. And metadatas is another list comprehension, creating a dictionary for each entry where "genre" is the key and the genre string is the value.

Become a Member to join the conversation.