When working with datatypes in Python I keep getting surprised by their intricate dynamic nature. When working with integer values in other programming languages we often have to declare the precision of the types that we are using. In the case of integers, we have a wide variety of ranges to choose between:

Type Also known as From To
int8 char -128 127
uint8 unsigned char 0 255
int16 short -32,768 32,767
uint16 unsigned short 0 65,535
int32 long -2,147,483,648 2,147,483,647
uint32 unsigned long 0 4,294,967,295
int64 long long -9,223,372,036,854,775,808 9,223,372,036,854,775,807
uint64 unsigned long long 0 18,446,744,073,709,551,615

But what is then happening when we in Python are using the int type? I originally guessed that it was simply used as an alias for one of the above, but that turned out to be wrong in general. Let’s show the implications of this with this short, and very real, example.

When dealing with Twitter data we often work with the IDs of tweets, as the Twitter terms of use states that we’re not allowed to share tweets directly, but instead we can share the IDs, from which they can be “rehydrated” (unless the user deleted them).

Here are some examples of tweet IDs:

  • 1496894936372813825
  • 1378982003966685186
  • 1321053468723941376

Since all of these are merely integers, it would feel natural to deal with them as such in Python. Sometimes we receive these IDs from REST APIs, which output string data, so we might find ourselves writing out the following piece of code:

>>> import numpy as np
>>> tweet_ids = get_tweet_ids_from_twitter()
>>> tweet_ids = np.asarray(tweet_ids, dtype=int)
>>> rehydrate_tweets(tweet_ids=tweet_ids)

Happy days, we got some integer tweet IDs! Let’s ship this to production, what could go wrong?

As I mentioned above, this piece of code will work on Unix-based operating systems. Indeed, on those systems we will see the following:

>>> tweet_ids
array([1496894936372813825, 1378982003966685186, 1321053468723941376])

All good. But on any Windows machine (no matter if the Windows distribution is 32-bit or 64-bit), we will suddenly see the following:

>>> tweet_ids
array([-2070601727, -1843974142,  1821806592])

Oh dear! Suddenly our script is trying to rehydrate negative tweet IDs and we face some very obscure error messages. We can fix this if we don’t allow Python to dynamically type, and instead be more specific in our typing. We can be accomplish this using the np.int64 type, where our code snippet above would now be written as:

>>> import numpy as np
>>> tweet_ids = get_tweet_ids_from_twitter()
>>> tweet_ids = np.asarray(tweet_ids, dtype=np.int64)
>>> rehydrate_tweets(tweet_ids=tweet_ids)

And hooray, this will now work on Windows as well!

As a little bonus, it turns out that even the native int function behaves in this way in Python2, so that we get the following on a Windows machine with Python2:

>>> int("1496894936372813825") = -2070601727

Thankfully, this has now been changed in Python3. Phew!