Summer isn’t all holidays and lazy days at the beach. Over the last month, two important players in the data science ecosystem released new major versions. NumPy published version 2.0, which comes with several improvements but also some breaking changes. At the same time, Polars reached its version 1.0 milestone and is now considered production-ready.
PyCon US was hosted in Pittsburgh, Pennsylvania in May. The conference is an important meeting spot for the community and sparked some new ideas and discussions. You can read about some of these in PSF’s coverage of the Python Language Summit, and watch some of the videos posted from the conference.
Dive in to learn more about the most important Python news from the last month.
NumPy Version 2.0
NumPy is a foundational package in the data science space. The library provides in-memory N-dimensional arrays and many functions for fast operations on those arrays.
Many libraries in the ecosystem use NumPy under the hood, including pandas, SciPy, and scikit-learn. The NumPy package has been around for close to twenty years and has played an important role in the rising popularity of Python among data scientists.
The new version 2.0 of NumPy is an important milestone, which adds an improved string type, cleans up the library, and improves performance. However, it comes with some changes that may affect your code.
The biggest breaking changes happen in the C-API of NumPy. Typically, this won’t affect you directly, but it can affect other libraries that you rely on. The community has rallied strongly and most of the bigger packages already support NumPy 2.0. You can check NumPy’s table of ecosystem support for details.
One of the main reasons for using NumPy is that the library can do fast and convenient array operations. For a simple example, the following code calculates square numbers:
>>> numbers = range(10)
>>> [number**2 for number in numbers]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> import numpy as np
>>> numbers = np.arange(10)
>>> numbers**2
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
First, you use range()
and a list comprehension to calculate the first ten square numbers in pure Python. Then, you repeat the calculation with NumPy. Note that you don’t need to explicitly spell out the loop. NumPy handles that for you under the hood.
Furthermore, the NumPy version will be considerably faster, especially for bigger arrays of numbers. One of the secrets to this speed is that NumPy arrays are limited to having one data type, while a Python list can be heterogeneous. One list can contain elements as different as integers, floats, strings, and even nested lists. That’s not possible in a NumPy array.
Improved String Handling
By enforcing all elements to be of the same type that take up the same number of bytes in memory, NumPy can quickly find and work with individual elements. One downside to this has been that strings can be awkward to work with:
>>> words = np.array(["numpy", "python"])
>>> words
array(['numpy', 'python'], dtype='<U6')
>>> words[1] = "monty python"
>>> words
array(['numpy', 'monty '], dtype='<U6')
You first create an array consisting of two strings. Note that NumPy automatically detects that the longest string is six characters long, so it sets aside space for each string to be six characters long. The 6
in the data type string, <U6
, indicates this.
Next, you try to replace the second string with a longer string. Unfortunately, only the first six characters are stored since that’s how much space NumPy has set aside for each string in this array. There are ways to work around these limitations, but in NumPy 2.0, you can take advantage of variable length strings instead:
>>> words = np.array(["numpy", "python"], dtype=np.dtypes.StringDType())
>>> words
array(['numpy', 'python'], dtype=StringDType())
>>> words[1] = "monty python"
>>> words
array(['numpy', 'monty python'], dtype=StringDType())
To use variable-length strings, you need to explicitly specify the StringDType
. These arrays are implemented by storing pointers in the array itself, where each element points to where in memory the value of the string can be found. Check out the NumPy documentation for more information.
String handling has been further improved by adding a dedicated np.strings
module that includes many functions for manipulating strings. For example, you can change both strings to title case by calling title()
:
>>> np.strings.title(words)
array(['Numpy', 'Monty Python'], dtype=StringDType())
All np.strings
functions are so-called universal functions that operate on the whole array at once.
Two improvements in NumPy 2.0 that break backward compatibility are simpler rules for type promotion and a clean-up of the np
namespace.
Simpler Type Promotion
When you’re doing a binary operation on two values of different types, it’s not always clear what the value of the result should be. In earlier versions of NumPy, the result type could even depend on the values of the operands. Consider the following example of adding a NumPy array to a scalar number:
>>> numbers = np.array([10, -20, 123], dtype=np.int8)
>>> numbers
array([ 10, -20, 123], dtype=int8)
>>> numbers + 20
array([ 30, 0, -113], dtype=int8)
>>> numbers + 200
array([210, 180, 323], dtype=int16)
You create an 8-bit integer array with three numbers. Such an 8-bit integer type can only represent the numbers from -128 to 127, inclusive. When you add 20 to each element in the array, the last element should be 123 + 20 = 143
. However, this number can’t be represented as an int8
, so it wraps around and becomes 143 - 2⁸ = -113
instead. This is the natural behavior of 8-bit integers.
In the third example, you do a similar operation but add 200 instead of 20. Now, your array is first promoted to a 16-bit integer before the calculation takes place, and all additions work without needing to wrap around.
In NumPy 2.0, implicit type promotions never depend on the value of the operators. If you run the same example as above in the new version of the library, you’ll see the following:
>>> numbers = np.array([10, -20, 123], dtype=np.int8)
>>> numbers
array([ 10, -20, 123], dtype=int8)
>>> numbers + 20
array([ 30, 0, -113], dtype=int8)
>>> numbers + 200
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python integer 200 out of bounds for int8
Instead of promoting numbers
to a 16-bit integer array, you’ll get an error saying that 200
can’t be represented as an 8-bit integer. To perform the calculation, you should cast the data type yourself. You can, for example, do the following:
>>> np.array(numbers, dtype=np.int16) + 200
array([210, 180, 323], dtype=int16)
These changes to type promotion make the logic simpler and more consistent with other similar libraries. You can read more about the motivation and the details of the change in NEP 50: Promotion Rules for Python Scalars.
Cleaner Namespace
In this new version, the main namespace, conventionally referred to as np
, gets a facelift. Many aliases and deprecated functions have been removed. For example, in older versions of NumPy, you can reference infinity (∞) in several different ways:
>>> np.inf
inf
>>> np.Infinity
inf
>>> np.Inf
inf
>>> np.infty
inf
>>> np.PINF
inf
In NumPy 2.0, you can only use np.inf
to refer to ∞:
>>> np.inf
inf
>>> np.Inf
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: `np.Inf` was removed in the NumPy 2.0 release.
Use `np.inf` instead.. Did you mean: 'inf'?
In most cases, you’ll get a helpful warning or error message when you try to access an attribute that has been removed.
You can find more information in NEP 52: Python API cleanup for NumPy 2.0.
NumPy 2.0 is a great step forward for the community. If you’re able to, you should test your code with the new version. With a bit of luck, you won’t see any issues. If your code breaks when you run it against NumPy 2.0, you can use the Ruff linter to help you upgrade your code.
Assume that you have the following code in a file named old_numpy.py
:
old_numpy.py
import numpy as np
numbers = np.array([3.14, -42, np.infty, np.NaN, 2024])
mask = numbers < np.Infinity
print(f"{mask.sum()} numbers are finite")
In this code, you create a mask of True
and False
values indicating whether numbers in an array are finite. Since you’re using some of the removed attributes, this code won’t work on NumPy 2.0. You can use Ruff to help you identify and fix these issues by selecting rule NPY201: numpy2-deprecation:
$ ruff check old_numpy.py --select NPY201
old_numpy.py:3:32: NPY201 [*] `np.infty` will be removed in NumPy 2.0. Use `numpy.inf` instead.
old_numpy.py:3:42: NPY201 [*] `np.NaN` will be removed in NumPy 2.0. Use `numpy.nan` instead.
old_numpy.py:4:18: NPY201 [*] `np.Infinity` will be removed in NumPy 2.0. Use `numpy.inf` instead.
Found 3 errors.
[*] 3 fixable with the `--fix` option.
If you want, you can even ask Ruff to automatically fix these issues for you by rerunning the command with --fix
added.
For more help on updating your code to be compatible with NumPy 2.0, check out the official migration guide.
Polars Version 1.0
Polars is a lightning-fast DataFrame library, offering much of the same functionality as pandas but often with better performance. Since its first release in 2021, the library has gained a lot of popularity because of its speed and consistent syntax.
On July 1, 2024, version 1.0 of Polars was released. With the new release, the development team signifies that the library is production-ready:
We are convinced that Polars is in a state where it is one of the best open-source choices for fast data modeling that focuses on vertical scaling. We are confident that the core of our API is solid and offers a strong base for further improvements to Polars. Another driving factor in this conviction is that the project is now backed by the Polars company, which can guarantee continuous effort and support. (Source)
While Polars has reached a mature version 1.0, there are many plans for further improvements, such as:
- A new design for the streaming engine
- GPU acceleration
- Polars cloud
You can read about these and other plans for the future on the Polars blog. If you’ve already used Polars, then you should check out the upgrade guide to make sure your code will work with Polars 1.0.
PyCon US Videos
The annual PyCon US conference took place in Pittsburgh in May. As always, the talks and presentations will be made available in a playlist on the PyCon US YouTube channel.
Currently, all keynotes, tutorials, lightning talks, and sponsor presentations are available. The regular presentations should also be posted on the channel shortly. As always, the keynotes were great, bringing fresh perspectives to the table:
- Jay Miller talked about the Black Python Devs community.
- Simon Willison talked about large-language models (LLMs), which he calls imitation intelligence.
- Kate Chapman talked about collaboration in sociotechnical systems or how people and technology work together.
- Sumana Harihareswara talked about Python packaging and shared stories about years working on PyPI and
pip
.
Additionally, the Python Steering Council talked about their activities over the last year, including describing some of the changes that are coming in Python 3.13. They also announced that Velda Kiara has been offered the position of secretary to the steering council. With this new role, the steering council hopes to be even more transparent about their discussions and activities going forward.
There are already many hours of great content available from the conference. Keep an eye out for the rest of the videos to drop.
Coverage of the 2024 Python Language Summit
The Python Language Summit is an annual meeting of core developers and others held during the week of PyCon US. While the talks and discussions aren’t recorded, the Python Software Foundation (PSF) publishes a write-up of the proceedings. This year, Seth Larson covered the summit in a series of blog posts.
As usual, the presentations covered a wide range of topics, including Python’s C-API, the new REPL, and Python on mobile. Have a look at the PSF blog to learn more about these and other interesting developments in your favorite language.
PEP 2026 - Calendar Versioning
One of the discussion topics at the language summit was whether Python should change its current versioning scheme and adopt calendar versioning (CalVer) instead. Since the meeting, the proposal has been written up in an official enhancement proposal: PEP 2026 - Calendar Versioning for Python.
The suggestion was presented by Hugo van Kemenade, who’s also the release manager for the upcoming 3.14 and 3.15 versions. The idea is to include the release year in the version number, which will make it easier to track the support lifecycle and calculate when a version reaches end-of-life.
The PEP proposes that instead of releasing Python 3.15 in 2026, the version number should be changed to 3.26 so the last number coincides with the release year. In practice, this skips eleven regular version numbers. However, it also adds some additional meaning to Python’s version specifiers, as it’ll be easier to keep track of release years.
The suggestion sparked some discussion at the language summit, and the PEP’s discourse thread has also seen a flurry of activity. Many people are positive about the proposed change, but there are also several concerns. Whether Python will adopt a new, calendar-based versioning scheme remains to be seen.
What’s Next for Python?
Python is very much a living language, with new changes and improvements constantly being discussed and implemented. While the final preparations for the next version of Python are well underway, it’s great to see that important developments are also happening in the broader ecosystem.
What’s your favorite Python news story from the last month? Let us know in the comments. Happy Pythoning!