Understanding Vectors

Joseph Peart

Vector Databases and Embeddings With ChromaDB Joseph Peart 06:42

Transcript
Discussion

00:00 As you might have expected, a grasp on vectors is really helpful for working with vector databases. In the broader fields of math and physics, vectors can be defined in many different ways.

00:10 But for our purposes, we can consider a vector a finite series of numbers.

00:15 Elements in vectors are ordered and represent what we would call features or dimensions, meaning vectors can be used to represent multidimensional data, like geographic coordinates (think latitude and longitude) stored in a two-element vector.

00:29 Or you could represent the pixels in an image, all in a single vector. And when working with spreadsheets or tabular data, a vector can represent records in a dataset, with each value representing a feature of the data.

00:42 And one of the main reasons vectors are used so widely is because, with simple math, vectors can turn fuzzy concepts like similarity and relevance into a precise, scalable computation.

00:54 There are some important properties of vectors that you should know. First, on the right side of this slide, you have three vectors in a two-dimensional coordinate system using the common representation of vectors as arrows extending from the point 0, 0, also called the origin.

01:09 Each vector has two values. The first value, which defines its position on the horizontal axis, and the second value, defining its position on the vertical.

01:18 These are the dimensions of the vector, meaning that these vectors have two dimensions. Next, we have magnitude. That’s the size or length of the vector, represented as a non-negative number.

01:31 Another characteristic of vectors is that they have a direction. You can think of that as an angle or where the vector points. And crucially, you can measure the similarity between vectors.

01:43 Vectors can be compared quantitatively. And it’s this key fact that enables embeddings, vector databases, and everything you’ll be doing in the rest of the course. So let’s drill down a little bit closer on this.

01:55 There’s actually a bunch of ways to compare vectors, but one of the most common and straightforward to compute is cosine similarity. To understand cosine similarity, you’ll need to see a few formulas. They’re small, and they’re all going to fit on this slide. And that’s it for the math. So the first is the vector norm, also known as its magnitude.

02:15 This formula is normally written as double bars around the vector. So the left-hand side is the norm of the vector a, and on the right, the formula, specifically for the Euclidean norm, also called the two norm, which is the square root of the sum of the squares of the elements in the vector.

02:33 So consider this vector 1, 3, and in Python, to calculate the magnitude, you would first square each element in the vector, then pass their sum to the sqrt() function of the math module.

02:43 The result is that the magnitude of this vector is roughly 3.16.

02:49 Next, we have something called the dot product, or scalar product, of two vectors. This is a single number that is proportional to the similarity between two vectors.

02:59 It can be positive, negative, or zero, based on the interior angle of the two vectors. That’s what that symbol on the graphic on the right represents. It looks a lot like a zero, but it’s actually the Greek letter theta.

03:11 The dot product of vectors a and b is written a dot b. And the formula is the magnitude of a multiplied by the magnitude of b multiplied by the cosine of the angle theta.

03:23 But conveniently, it’s also equal to the sum of the element-wise products of a and b. So for vectors a and b, 1, 3 and 2, 2, respectively, you can calculate the dot product by multiplying the first element of a with the first element of b, the second element of a with the second element of b, and then summing those two products, resulting in 8, a positive number because they’re pointing in roughly the same direction.

03:49 And while you might think you can stop here, since the dot product does reflect similarity, it’s not the best measurement because it’s also based on the size of the vectors. Longer vectors have larger values, and even if they’re only barely pointing in the same direction, they’ll have a larger dot product.

04:06 This is where cosine similarity comes in. Cosine similarity is the normalized dot product of two vectors. It isn’t influenced by their scale, only their direction.

04:16 It’s often written as capital S for similarity, sub C for cosine, and it’s simply the cosine of the interior angle of the two vectors. Something cool about the cosine function is that its bounds are minus 1 and 1.

04:31 If the angle between the vectors is 0 degrees, cosine 0 is 1. If the angle is 180, meaning they’re pointing in exactly opposite directions, the cosine is negative 1.

04:43 And if they’re at a right angle, also called orthogonal to each other, then the cosine is 0. This naturally makes it a really easy number to work with. And it can be calculated in Python pretty quickly using the very right-hand side of this expression, the dot product of a and b divided by the product of the magnitudes of a and b. Since you know how to calculate dot products and how to calculate magnitudes, you have everything you need to calculate cosine similarity.

05:10 In pure Python, for the same vectors a and b from before, calculate a_dot_b by summing the element-wise products of a and b, get a_norm as the square root of the sum of squares of elements in a, derive b_norm in the same fashion, and divide a_dot_b by a_norm times b_norm.

05:30 With these two vectors, it yields a cosine similarity of 0.89, indicating that they are very similar.

05:37 If you change the vectors so they have a wider spread, a wider interior angle like this, with a as 1, 3 and b as 3, 1, you’ll see the cosine similarity is lower, at 0.59 repeating.

05:50 And if you instead make a negative 1, 3 and b 3, 1, they’re orthogonal vectors. This results in a cosine similarity of 0.

06:00 Something important to note is that all of these operations work on vectors with more than two dimensions as well. As long as the vectors being compared have the same dimensionality, it doesn’t matter if they have 2, 200, or 2 million dimensions.

06:14 Of course, it would be unwieldy to write out all of these element-wise operations one at a time like you just saw.

06:21 Thankfully, you can use the NumPy library to help you out with that. Let’s take a quick look at NumPy’s vector operations. If you haven’t used NumPy before, this tutorial will help you out.

06:32 NumPy Tutorial: Your First Steps Into Data Science in Python. Make sure you’re in a virtual environment with NumPy installed, and when you’re ready, pop open the REPL.

Become a Member to join the conversation.