Join us and get access to thousands of tutorials and a community of expert Pythonistas.

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Tackling Text Embeddings

Vector Databases and Embeddings With ChromaDB Joseph Peart 06:31

00:00 It’s time to talk about text embeddings. You can use text embeddings to represent entire documents as vectors. Each embedding can represent a few words, or even thousands of words or more.

00:11 They capture broader, document-level characteristics of texts. You can think of each dimension as encoding a feature of the document as a whole, but just like with word embeddings, dimensions aren’t directly interpretable by humans.

00:25 Naturally, related documents can be found using vector similarity, and thankfully, like word embeddings, pretrained text embeddings are widely available. But remember, and this is something that confuses a lot of students at first, for any one embedding model, all embeddings produced will be the same length, regardless of the size of their inputs.

00:45 This is a requirement, since vectors of different dimensionality can’t be compared with each other. So in a sense, you can also think of these embeddings as numerical summaries of texts.

00:57 You’ll be using the SentenceTransformers library for working with text embeddings. SentenceTransformers is a great resource of pretrained models that can generate text embeddings.

01:06 It’s built on top of state-of-the-art transformer-based language models, which is where it gets its name, and it was created to make embeddings of sentences and larger texts accessible to everyone.

01:16 It has a very flexible API, meaning different pretrained models can be swapped in and out easily, depending on needs, such as speed, accuracy, or domain specificity.

01:26 In your virtual environment, install sentence-transformers with the command python -m pip install sentence-transformers, and join me in the REPL.

01:37 Start by importing the SentenceTransformer class from sentence_transformers.

01:44 Now load the model. model equals SentenceTransformer("all-MiniLM-L6-v2"). One of the smaller models, but quite versatile, as it was trained on a wide variety of online texts.

01:58 And as a heads up, it might take a minute the first time you try to use a model with sentence_transformers, since it needs to be downloaded and saved locally.

02:06 Also, you can see I got this unexpected value showing up, but as it says, it can be ignored. Next, create a list of strings, texts to be creating embeddings for.

02:16 texts equals a list. "The cat stared at me passive-aggressively."

02:25 "The feline looked at me with attitude."

02:31 "Her dinner was late, and she knew it."

02:36 "She was aware that her supper was late."

02:41 Two pairs of similar sentences, well, similar to English-speaking humans, anyway. So let’s see how the embeddings fare. To create embeddings, all you have to do is pass the texts into the model’s .encode() method.

02:55 text_embeddings equals model.encode(texts). You can type check the output. type(text_embeddings), and yep, it’s a NumPy array.

03:06 This means you can check the .shape attribute. text_embeddings.shape, and see that you’ve got four embeddings, each with 384 dimensions.

03:16 I’ll remind you one more time that the dimensionality of the embeddings is based on the model, and not the size of the text. To compare your embeddings, import that handy compute_cosine_similarity() function.

03:30 from cosine_similarity import compute_cosine_similarity. And to make it a little clearer what you’re working with, create a dictionary where the texts are keys and the embeddings are values, using dict() and zip().

03:45 text_embeddings_dict equals dict(), passing in the result of calling zip(), passing in texts, and text_embeddings as a list.

03:53 And look at the keys. text_embeddings_dict.keys().

03:59 These keys are the same four sentences we started with, so let’s assign the first two as variables.

04:06 text1 equals "The cat stared at me passive-aggressively." text2 equals "The feline looked at me with attitude." Now you can access embeddings on text_embeddings_dict by using text1 and text2 as keys.

04:25 So compare the two embeddings.

04:31 compute_cosine_similarity(), text_embeddings_dict at text1, text_embeddings_dict at text2,

04:40 and get a similarity score of about 0.64. Both statements about my cat, I mean a cat in a bad mood, and the similarity is 0.64. If you reflect on the single word scores you saw in the previous lesson, I think that’s pretty good, considering these sentences are using a bunch of different words.

04:59 Now assign the next two sentences to variables. text3 equals "Her dinner was late, and she knew it." text4 equals "She was aware that her supper was late." And compare.

05:18 compute_cosine_similarity(), passing in text_embeddings_dict at text3, and text_embeddings_dict at text4.

05:27 Wow, 0.85 similarity. That’s really high, even though the sentences are structured very differently and a few different words were used as well. Just like before, I encourage you to play around with this.

06:08 Low, which I think you’d expect. The sentences flow together, sure, but individually they’re completely different sentences, as reflected by the similarity score.

06:18 And that pretty much covers the foundations of embeddings. You know where they come from and what you can do with them. So the next step is, where do you put them?

06:28 We’ll answer that question in the next lesson.

Become a Member to join the conversation.