The NumPy where()
function is a powerful tool for filtering array elements in lists, tuples, and NumPy arrays. It works by using a conditional predicate, similar to the logic used in the WHERE or HAVING clauses in SQL queries. It’s okay if you’re not familiar with SQL—you don’t need to know it to follow along with this tutorial.
You would typically use np.where()
when you have an array and need to analyze its elements differently depending on their values. For example, you might need to replace negative numbers with zeros or replace missing values such as None
or np.nan
with something more meaningful. When you run where()
, you’ll produce a new array containing the results of your analysis.
You generally supply three parameters when using where()
. First, you provide a condition against which each element of your original array is matched. Then, you provide two additional parameters: the first defines what you want to do if an element matches your condition, while the second defines what you want to do if it doesn’t.
If you think this all sounds similar to Python’s ternary operator, you’re correct. The logic is the same.
Note: In this tutorial, you’ll work with two-dimensional arrays. However, the same principles can be applied to arrays of any dimension.
Before you start, you should familiarize yourself with NumPy arrays and how to use them. It will also be helpful if you understand the subject of broadcasting, particularly for the latter part of this tutorial.
In addition, you may want to use the data analysis tool Jupyter Notebook as you work through the examples in this tutorial. Alternatively, JupyterLab will give you an enhanced notebook experience, but feel free to use any Python environment.
The NumPy library is not part of core Python, so you’ll need to install it. If you’re using a Jupyter Notebook, create a new code cell and type !python -m pip install numpy
into it. When you run the cell, the library will install. If you’re working at the command line, use the same command, only without the exclamation point (!).
With these preliminaries out of the way, you’re now good to go.
Get Your Code: Click here to download the free sample code that shows you how to use conditional expressions with NumPy where().
Take the Quiz: Test your knowledge with our interactive “How to Use Conditional Expressions With NumPy where()” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Use Conditional Expressions With NumPy where()This quiz aims to test your understanding of the np.where() function. You won't find all the answers in the tutorial, so you'll need to do additional research. It's recommended that you make sure you can do all the exercises in the tutorial before tackling this quiz. Enjoy!
How to Write Conditional Expressions With NumPy where()
One of the most common scenarios for using where()
is when you need to replace certain elements in a NumPy array with other values depending on some condition.
Consider the following array:
>>> import numpy as np
>>> test_array = np.array(
... [
... [3.1688358, 3.9091694, 1.66405549, -3.61976783],
... [7.33400434, -3.25797286, -9.65148913, -0.76115911],
... [2.71053173, -6.02410179, 7.46355805, 1.30949485],
... ]
... )
To begin with, you need to import the NumPy library into your program. It’s standard practice to do so using the alias np
, which allows you to refer to the library using this abbreviated form.
The resulting array has a shape of three rows and four columns, each containing a floating-point number.
Now suppose you wanted to replace all the negative numbers with their positive equivalents:
>>> np.where(
... test_array < 0,
... test_array * -1,
... test_array,
... )
array([[3.1688358 , 3.9091694 , 1.66405549, 3.61976783],
[7.33400434, 3.25797286, 9.65148913, 0.76115911],
[2.71053173, 6.02410179, 7.46355805, 1.30949485]])
The result is a new NumPy array with the negative numbers replaced by positives. Look carefully at the original test_array
and then at the corresponding elements of the new all_positives
array, and you’ll see that the result is exactly what you wanted.
Note: The above example gives you an idea of how the where()
function works. If you were doing this in practice, you’d most likely use either the np.abs()
or np.absolute()
functions instead. Both do the same thing because the former is shorthand for the latter:
>>> np.abs(test_array)
array([[3.1688358 , 3.9091694 , 1.66405549, 3.61976783],
[7.33400434, 3.25797286, 9.65148913, 0.76115911],
[2.71053173, 6.02410179, 7.46355805, 1.30949485]])
Once more, all negative values have been removed.
Before moving on to other use cases of where()
, you’ll take a closer look at how this all works. To achieve your aim in the previous example, you passed in test_array < 0
as the condition. In NumPy, this creates a Boolean array that where()
uses:
>>> test_array < 0
array([[False, False, False, True],
[False, True, True, True],
[False, True, False, False]])
The Boolean array, often called the mask, consists only of elements that are either True
or False
. If an element matches the condition, the corresponding element in the Boolean array will be True
. Otherwise, it’ll be False
.
This mask array is always the same shape as the original array it’s based on, producing a one-to-one correspondence between the two. This means that the elements in the mask can be matched against the corresponding elements in the test_array
to determine how the conditions are applied to each test_array
element.
To see this, take a look at the top-left element of test_array
, which is 3.1688358
. Since this is not less than zero, the top-left element in the Boolean array is False
. Conversely, the final element in the top row of test_array
does match the condition because -3.61976783
is indeed less than zero. The final element in the top row of the Boolean array is, therefore, True
.
In this example, you want to apply test_array * -1
to each element matching a True
value of this Boolean array. Conversely, if the original element is zero or more, the original test_array
element will be applied instead. In other words, it’ll remain unchanged.
Take a careful look back at the original test_array
and the resulting all_positives
array. You’ll see that all negative elements from test_array
have been replaced with their positive counterparts, while the original positive elements haven’t been changed. Had an element been 0
, it wouldn’t have changed either.
Note: In this tutorial you’ll work with pre-defined arrays to make sure your results match those shown. If you want to experiment with the where()
function and generate arrays using random numbers, you can do so by using a built-in random number generator that comes as part of NumPy.
For example, you could have created your own randomized version of test_array
using the following:
>>> test_array = np.random.uniform(low=-10, high=10, size=(3, 4))
>>> test_array
array([[-2.71697178, -2.49701546, -7.57662054, -9.41817892],
[ 2.43095102, 0.7143025 , 0.25938839, -4.78215376],
[-7.13802191, -5.47446998, -0.47173589, -0.36727671]])
When you run this code, you’ll produce your own three-row, four-column array containing random numbers that are different from those shown above. The minimum possible number you may see will be -10
, but the maximum number will be just less than 10
.
The where()
function will work with this array in the same way as the more controlled version you used earlier, but your results will vary each time you re-generate test_array
.
Congratulations! You’ve now written some code that demonstrates the basic use case of the where()
function. If you’re ready for more, read on to learn how to use more complex conditions.
How to Use Multiple Conditional Expressions
In the previous section, you successfully replaced all negative numbers with their positive counterparts. Suppose you wanted to do this only for values between -2 and 3, while leaving all others unchanged. To do this, you need to apply a more complex condition.
With your existing knowledge of if-else
, you might be tempted to try something like this:
>>> import numpy as np
>>> test_array = np.array(
... [
... [3.1688358, 3.9091694, 1.66405549, -3.61976783],
... [7.33400434, -3.25797286, -9.65148913, -0.76115911],
... [2.71053173, -6.02410179, 7.46355805, 1.30949485],
... ]
... )
>>> np.where(
... (test_array > -2) and (test_array < 3),
... test_array * -1,
... test_array,
... )
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError: The truth value of an array with more than one element is ambiguous.
Instead of giving you an answer, your code has raised a ValueError
exception and crashed. Not exactly what you’d hoped for.
The reason this happened is because the and
operator can only work with individual elements. It doesn’t understand arrays of values. When you write code such as test_array > -2
, Python creates a Boolean array behind the scenes. While the where()
function can cope with this as its condition parameter, using it with and
raises the error.
The solution is to use the bitwise AND operator (&) instead. In NumPy, this operator is overloaded to do elementwise AND operations. It compares the values of both Boolean arrays element by element and returns a single Boolean array of the result. This result can then be understood by the where()
function and safely applied to the original array.
The code below shows the steps required to produce such an array:
>>> test_array > -2
array([[ True, True, True, False],
[ True, False, False, True],
[ True, False, True, True]])
>>> test_array < 3
array([[False, False, True, True],
[False, True, True, True],
[ True, True, False, True]])
>>> (test_array > -2) & (test_array < 3)
array([[False, False, True, False],
[False, False, False, True],
[ True, False, False, True]])
As you already know, running test_array > -2
and test_array < 3
produces two Boolean arrays. This time, to compute the logical conjunction of both, you used the &
operator, which has successfully created a third Boolean array based on the result of applying &
against each pair of elements. The resulting array will contain True
values if the corresponding elements in both Boolean arrays are True
. All other values will be False
.
When this Boolean array is passed into where()
, the result is far more palatable:
>>> np.where(
... (test_array > -2) & (test_array < 3),
... test_array * -1,
... test_array,
... )
array([[ 3.1688358 , 3.9091694 , -1.66405549, -3.61976783],
[ 7.33400434, -3.25797286, -9.65148913, 0.76115911],
[-2.71053173, -6.02410179, 7.46355805, -1.30949485]])
This time, only values between -2
and 3
have changed their sign. In other words, those values that are both greater than -2
and less than 3
are replaced.
Knowing how to use multiple conditions and understanding how to use parentheses to control operator precedence allows you to create some really complex analysis conditions and unleash the real power of where()
.
As another example, suppose you wanted to flip the signs in your original test_array
, but only if the number is less than or equal to -2 or greater than or equal to 3:
>>> np.where(
... (test_array <= -2) | (test_array >= 3),
... test_array * -1,
... test_array,
... )
array([[-3.1688358 , -3.9091694 , 1.66405549, 3.61976783],
[-7.33400434, 3.25797286, 9.65148913, -0.76115911],
[ 2.71053173, 6.02410179, -7.46355805, 1.30949485]])
Here, you’ve used the bitwise OR operator (|
). Similarly to &
, this operator has been overloaded to do elementwise OR operations. The expression (test_array <= -2) | (test_array >= 3)
once again produces two Boolean arrays before using the |
operator to combine them into one. This time, True
will appear in the resulting Boolean array if, and only if, at least one of the corresponding elements is True
. Attempting this with or
will again produce a ValueError
for the same reason and
did earlier.
Take a careful look at both your original test_array
and the resulting array, and you’ll see that the conversion has only been applied to numbers that fall outside the (-2, 3) interval.
It’s time to consolidate your learning with an exercise. Have a try at this:
Create a five-row by four-column array using the following code:
>>> question_1 = np.arange(-10, 10).reshape(5, 4)
Your array will contain all numbers from -10
to 9
in a sequence. Now, use it to solve the following challenges:
-
Use
where()
to create an array that replaces the elements inquestion_1
with the number9
if they are either negative or even. Before you run your code, see if you can predict how many nines there will be in the new array. Were you correct? -
Next, use
where()
to create an array that has squared each negative odd number inquestion_1
. -
Finally, use
where()
to create an array to replace all elements inquestion_1
that are between3
and7
or equal to1
, with-10
. For all other elements, subtract one from them. Oh, and do take care with operator precedence.
One possible solution for the first question is:
>>> np.where(
... (question_1 < 0) | (question_1 % 2 == 0),
... 9,
... question_1,
... )
array([[9, 9, 9, 9],
[9, 9, 9, 9],
[9, 9, 9, 1],
[9, 3, 9, 5],
[9, 7, 9, 9]])
Here, you used the less than operator (<) to filter negative numbers and the modulo operator (%) to filter out each even number. By using the elementwise |
operator, you filtered values that matched either condition.
If you said there would be sixteen nines in the result, well done. If you said there would only be fifteen, well done on understanding how where()
works. Unfortunately, you forgot to count the existing 9
.
One possible solution for the second question is:
>>> np.where(
... (question_1 < 0) & (question_1 % 2 != 0),
... np.square(question_1),
... question_1,
... )
array([[-10, 81, -8, 49],
[ -6, 25, -4, 9],
[ -2, 1, 0, 1],
[ 2, 3, 4, 5],
[ 6, 7, 8, 9]])
This time, you used the less than operator (<) to filter negative numbers and the modulo operator (%) to filter out each odd number. By using the &
operator, you filtered values that matched both conditions. The np.square()
function did the squaring of the filtered elements for you.
One possible solution for third question is:
>>> np.where(
... ((question_1 > 3) & (question_1 < 7)) | (question_1 == 1),
... -10,
... question_1 - 1,
... )
array([[-11, -10, -9, -8],
[ -7, -6, -5, -4],
[ -3, -2, -1, -10],
[ 1, 2, -10, -10],
[-10, 6, 7, 8]])
You used the &
operator to filter values between 3
and 7
. Then, you took the resulting Boolean array and used the |
operator to include values that equal 1
.
With that workout complete, it’s time for you to move on and learn how to perform array broadcasting conditionally.
How to Use Array Broadcasting in Conditional Expressions
In the examples you’ve seen so far, the conditions have performed a calculation on the existing array’s elements to produce a new value. While this is a very common use case for where()
, you can also use the where()
function to replace elements in an array with those from other arrays, depending on the result of your condition.
To make this possible, the arrays you use in the condition must be broadcast compatible with the original array whose values you want to replace. Broadcasting allows you to perform operations between arrays with different shapes without having to write complicated loops.
Two arrays are broadcast compatible if their rightmost dimensions are identical, or either of these dimensions is 1. Once your arrays are broadcast compatible, you can use them together with the where()
function.
As an example, suppose you have the following array:
>>> booking_data = np.array(
... [
... [np.nan, np.nan, 1],
... [1, 1, np.nan],
... [1, np.nan, 1],
... [1, 1, 1],
... ]
... )
Next, imagine that your booking_data
array contains the details of meal reservations for a hotel. Each row represents a separate guest, while each column represents menu requirements. You use a 1
in the leftmost column to represent a breakfast request, a 1
in the center column to represent a lunch request, and a 1
in the rightmost column to represent an evening meal request. np.nan
indicates that the meal hasn’t been requested.
Your array contains four rows and three columns. This is defined by the booking_data
array’s .shape
instance variable:
>>> booking_data.shape
(4, 3)
Now consider this array:
>>> meal_prices = np.array([5.1, 8.2, 20.3]).reshape(1, 3)
>>> no_charge = 0
The meal_prices
array contains one row and three columns of price information. The price of breakfast is $5.10, lunch is $8.20, and an evening meal is $20.30. The array shape this time is (1, 3).
The point to note here is that booking_data
and meal_prices
are broadcast compatible because their rightmost dimensions of 3
are identical. This allows you to replace elements in one array with those from the other.
You’ve also created a no_charge
variable and assigned it a value of 0
. Although this is a single number and not an array, single numbers are broadcastable across any size of array. In other words, they are always broadcast compatible.
Now, suppose you want to clean up your booking_data
array by creating a new booking_prices
array that replaces each 1
with its corresponding prices and each np.nan
with a 0
. The where()
function can do this for you:
>>> booking_prices = np.where(booking_data == 1, meal_prices, no_charge)
>>> booking_prices
array([[ 0. , 0. , 20.3],
[ 5.1, 8.2, 0. ],
[ 5.1, 0. , 20.3],
[ 5.1, 8.2, 20.3]])
As you can see, where the booking_data == 1
, the corresponding element from meal_prices
has been inserted into booking_prices
. Otherwise, a 0
has been inserted.
Although this is certainly a powerful use of where()
, the principles here are the same as in earlier use cases. The booking_data == 1
parameter created a Boolean array. In cases where an element in this Boolean array is True
, the corresponding element from the meal_prices
array is used in the result. Where an element is False
, the value of no_charge
, or 0
, is used instead.
You may have noticed that the inserted value of no_charge
is a float
. This is because there are already floats in the array, so any integers are automatically upsized to become float
types to keep the array homogeneous.
Time for another workout:
Create a question_2 array
using this code:
>>> question_2 = np.arange(12).reshape(3, 4)
Next, create two variables—high
and low
—and assign them strings as shown:
>>> high = "HIGH"
>>> low = "LOW"
Now, use the where()
function to replace all numbers greater than 6
with the string “HIGH”, and everything else with “LOW” using the following three techniques:
-
Use the
question_2
,high
, andlow
variables as defined above. -
Assign new
high
andlow
variables with arrays that contain the strings “HIGH” and “LOW”, respectively. -
As an extra challenge, see if you can make both of these arrays different shapes, but still broadcast compatible with
question_2
.
In each case, the result should be identical.
One possible solution for the first question is:
>>> question_2 = np.arange(12).reshape(3, 4) # Shape (3, 4)
>>> high = "HIGH"
>>> low = "LOW"
>>> np.where(question_2 > 6, high, low)
array([['LOW', 'LOW', 'LOW', 'LOW'],
['LOW', 'LOW', 'LOW', 'HIGH'],
['HIGH', 'HIGH', 'HIGH', 'HIGH']], dtype='<U4')
One possible solution for the second question is:
>>> question_2 = np.arange(12).reshape(3, 4) # Shape (3, 4)
>>> high = np.array(["HIGH"]) # Shape (1,)
>>> low = np.array(["LOW"]) # Shape (1,)
>>> np.where(question_2 > 6, high, low)
array([['LOW', 'LOW', 'LOW', 'LOW'],
['LOW', 'LOW', 'LOW', 'HIGH'],
['HIGH', 'HIGH', 'HIGH', 'HIGH']], dtype='<U4')
One possible solution for the third question is:
>>> question_2 = np.arange(12).reshape(3, 4) # Shape (3, 4)
>>> high = np.array(["HIGH", "HIGH", "HIGH", "HIGH"]) # Shape (4,)
>>> low = np.array(["LOW"]) # Shape (1,)
>>> np.where(question_2 > 6, high, low)
array([['LOW', 'LOW', 'LOW', 'LOW'],
['LOW', 'LOW', 'LOW', 'HIGH'],
['HIGH', 'HIGH', 'HIGH', 'HIGH']], dtype='<U4')
In this last solution, you could have swapped the shapes of high
and low
around.
To finish off, you’ll see what is effectively the simplest use case of where()
. You’ll also learn the importance of reading documentation carefully to highlight such use cases.
How Not to Use np.where() - A Final Quirk
When you read the official documentation for where()
, the definition of the function may make it look a little more complicated than it is:
numpy.where(condition, [x, y, ]/)
As with all Python documentation, it’s tempting to skip over this information and look at some examples instead. However, if you take some time to read it, you’ll gain a better understanding of the different ways the function can be used.
First of all, the definition ends with a forward slash (/) character. You might think this represents a division or line continuation symbol, but it’s neither. By placing the forward slash special parameter at the end, the documentation is telling you that each parameter passed must be passed by position and not by keyword.
The first parameter is the condition the elements are tested against, while the second and third parameters, formally documented as x
and y
, define the true or false actions to be taken depending on the result of the condition. However, using these parameter names in code is not allowed.
You may also notice that the x
and y
parameters are encased in square brackets. You could be forgiven for thinking this is telling you to supply these parameters as a Python list. In fact, the square brackets here indicate that both x
and y
are optional. You should also note that you can’t pass only one of them.
In this tutorial, you’ve always used three parameters because this is the most common approach. However, now that you know only the first parameter is mandatory, you may be wondering what happens if you omit the other two. To find out, take a look at the code shown below:
>>> import numpy as np
>>> mostly_zeroes = np.array(
... [[9, 0, 0],
... [0, 8, 5],
... [0, 0, 7]])
>>> np.where(mostly_zeroes != 0)
(array([0, 1, 1, 2]), array([0, 1, 2, 2]))
If you provide the where()
function with only a condition
parameter, it’ll return a Python tuple containing arrays of the indices of those elements whose values are non-zero. There will be one array for each dimension. This is why two arrays are returned in the above example: mostly_zeroes
has two dimensions (3, 3).
This somewhat confusing output tells you that the elements at positions (0, 0), (1, 1), (1, 2), and (2, 2) are all non-zero. In other words, they correspond to True
values in the underlying Boolean array produced by the condition. The other elements are zero.
This is extremely useful for highlighting non-zero elements in a data analysis.
The documentation doesn’t recommend using where()
this way, but instead advises you to use the nonzero()
function directly:
>>> np.nonzero(mostly_zeroes)
(array([0, 1, 1, 2]), array([0, 1, 2, 2]))
The result is identical to the previous example because passing only a condition
argument to where()
results in a call to nonzero()
behind the scenes. There’s little point in using where()
to do this because it only adds overhead to the nonzero()
call. You can also use nonzero()
to find the indices of other conditions:
>>> np.nonzero(mostly_zeroes == 5)
(array([1]), array([2]))
In this case, only the element at (1, 2) is equal to five. This works because, as you’ve seen earlier, the condition mostly_zeroes == 5
is interpreted as a Boolean array. Then, in that mask True
is interpreted as 1
and False
as 0
. In other words, all elements satisfying the condition are non-zero.
Conclusion
You now have a comprehensive understanding of how to use NumPy’s where()
function, its parameters, and how they’re used to perform tasks on array elements depending on the value of those elements.
Congratulations on completing this tutorial, and enjoy applying these newfound skills to your future data analysis projects!
Get Your Code: Click here to download the free sample code that shows you how to use conditional expressions with NumPy where().
Take the Quiz: Test your knowledge with our interactive “How to Use Conditional Expressions With NumPy where()” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Use Conditional Expressions With NumPy where()This quiz aims to test your understanding of the np.where() function. You won't find all the answers in the tutorial, so you'll need to do additional research. It's recommended that you make sure you can do all the exercises in the tutorial before tackling this quiz. Enjoy!