Working With Word Embeddings

Joseph Peart

Vector Databases and Embeddings With ChromaDB Joseph Peart 06:35

Resource mentioned in this lesson: Natural Language Processing With spaCy in Python

00:00 Now that you’ve got your head around vectors and cosine similarity, it’s time to examine embeddings, starting with word embeddings, which represent individual words as vectors. With these embeddings, each word is mapped to a vector with many dimensions.

00:15 Each of those dimensions represents a feature learned from word usage in the training data, things like frequency and patterns of word co-occurrence. There can be many dimensions, hundreds even, but they’re not directly interpretable by humans.

00:28 They really only have meaning when you look at them all together. Since each word is mapped to a vector, we can consider the semantic similarity between words to be related to the vector similarity between word vectors.

00:41 And embeddings are typically created by algorithms that are trained by consuming large amounts of text. But you don’t need to create your own embeddings.

00:49 Many pretrained embeddings exist, they’re widely available, and pretty effective too.

00:55 Here’s an example of word vectors projected down to two dimensions, for visualization’s sake. Dimension 1 and Dimension 2 aren’t directly interpretable. But you see the word vectors, shown here as points in space, form clusters of similarity.

01:10 A fruits cluster, an animal cluster, a cluster of positive adjectives, and a cluster of vehicles. As a human being who speaks English, these groupings seem pretty reasonable to me.

01:21 What’s really cool is that these are created by unsupervised algorithms based on ingesting tons of text, no pre-existing understanding of English required.

01:31 And remember, this is just a limited 2D representation. The actual word embeddings would have many more dimensions. The important thing to understand is that by using these vector representations, you can objectively measure similarity between words, which is key for your ultimate goal of retrieving relevant documents for an LLM later in this course.

01:52 To explore word embeddings, you’ll use the spaCy library. spaCy is a natural language processing, or NLP, library in Python. It includes pretrained word embeddings in its built-in models, it provides model vocabularies that map words to vectors, and it has an intuitive API, making it easy to explore and compare embeddings with minimal setup.

02:15 You can install spaCy using pip, with the command python -m pip install spacy. And you’ll also need to download some word embeddings to work with.

02:25 I’ll be using en_core_web_lg, which has over 300,000 unique word embeddings. Download it with python -m spacy download en_core_web_lg.

02:37 spaCy has a ton of useful tools beyond what we’ll explore here. So if you have a more general interest in NLP, here’s another course for your bookmarks, Natural Language Processing With spaCy in Python.

02:48 Now go ahead and install spaCy and open up the REPL.

02:53 First, create a file cosine_similarity.py and define the function you see here. Or just grab it from the downloadable course resources. If it looks familiar, good.

03:03 It’s a function version of the formula for deriving cosine similarity that you used at the end of the previous lesson.

03:10 Okay, REPL time. Start by importing spacy. Import your new compute_cosine_similarity() function from cosine_similarity.

03:25 Load the embedding model into a variable called nlp. nlp equals spacy.load("en_core_web_lg"). And you can access the embedding for a given word like this.

03:37 cat equals nlp.vocab, which behaves like a dictionary, at the string key "cat", and then get its vector attribute by dot accessing .vector.

03:47 And if you check the type() of cat, you’ll see it’s a numpy array. To confirm the dimensionality of the embedding, you can check its .shape attribute.

03:56 300 dimensions, wow. Take a quick peek at a few of the values with a slice. cat[:10] returns a mix of fairly small negative and positive numbers.

04:07 Remember, these aren’t individually interpretable, but in aggregate, they should represent the abstract concept of the word “cat.” To make some comparisons, get some more embeddings.

04:19 dog, apple, tasty,

04:30 delicious, and spaceship. Use your compute_cosine_similarity() function to compare cat and dog.

04:41 compute_cosine_similarity() passing in cat and dog returns 0.8.

04:46 Remember, the highest score is 1, so that’s pretty high. How about cat and spaceship?

04:57 compute_cosine_similarity() passing in cat and spaceship.

05:01 0.13, pretty low, and reasonable. I can’t think of any significant similarities between cats and spaceships. How about two words that are pretty much synonyms?

05:13 compute_cosine_similarity() tasty and delicious.

05:17 0.83, very high, and a clear indicator that the embedding understands these are very similar words. What if you try comparing tasty and apple?

05:30 0.43, so still very dissimilar. Just for fun, a couple more. cat, apple.

05:40 0.28. delicious, spaceship.

05:46 0.04, the lowest score you’ve seen so far. Let me just remind you one more time that the individual 300 dimensions of each embedding can’t be interpreted. But it is fun to try to figure out the contributing factors for why there’s such a spread of cosine similarity scores seen amongst these words.

06:05 I encourage you to play around with even more words and compare the kinds of scores you get. You might be surprised by some words that you think are similar that can have fairly low scores.

06:16 And while this is super interesting, there’s still a major shortcoming to word embeddings. Words are judged on their own, without the context of their surrounding sentence or paragraph.

06:26 If you really want to retrieve related documents to feed the LLMs, you’re going to need to go one better. So next lesson, text embeddings.

Become a Member to join the conversation.