For the Love of God, Stop Using iterrows()

It is always wrong to use the iterrows method in Pandas.

If I were on the Pandas dev team, I would have no hesitation depreciating it and then deleting it out of existence.

There are two problems with iterrows:

Problem 1. Loops in Pandas are a sin.

The first and most important problem is that, 99.999% of the time, you should not be iterating over rows in a DataFrame.

Iteration beats the whole purpose of using Pandas. If we wanted to iterate over a list, we would just store our data as a list of tuples. The point of using something more sophisticated than naive data storage (such as Pandas and R and Spark and SAS and Stata and MATLAB and even Excel) is the ability to perform vectorized operations. When you’re working with millions or billions of rows, the difference in performance is staggering. Many modern computing systems, especially those utilizing GPUs, are designed specifically to take advantage of the speed of vectorized operations by computing things in parallel, which you can’t do with iterative processes. If you are not going to take advantage of vectorization with Pandas, just save yourself some memory and processing power by just storing your data as a naive list of lists, which will work faster and more efficiently than a Pandas DataFrame that’s basically being used like a glorified list.

Even if performance literally doesn’t matter because you’re working with a mere 1,000 rows on a supercomputer, it’s still good form to vectorize your transformations because it’s a better way to think about doing data science (ML and econometrics are just applied linear algebra + multivariate calculus), it keeps your code much cleaner and easier to read, using vectorized functions is more likely to avoid changing state, and it’s a good habit to get used to for when performance does matter. In other words, performance optimization is not the only reason vectorization is the correct way to use Pandas, even though it would be a pretty good reason on its own.

If you have brainstormed about a problem and you’re about to give up on a vectorized solution, make sure you have also exhausted any avenues that use the apply method. Very often, itertuples solutions should be apply solutions. It is undesirable to use the apply method over Pandas built-in methods because they are still very slow, but it’s not per se wrong to use the apply method because Pandas doesn’t have a built-in for everything. There are some instances in which you can use apply to emulate an iterative process. It is still very possible to overdo this, and on more than one occasion I’ve had to clean up someone else’s Jupyter notebook on a collaborative process that was littered with apply()‘s that shouldn’t have been there. But it is less sinful than iterrows and other iterative approaches, both aesthetically and performance-wise. Note there’s a slight caveat to the relative performance of the apply method, but I’ll cover that later in this article.

Problem 2. There are better ways to do basically the same thing.

The second problem with iterrows is that even when you encounter the 0.001% of the time that it is OK to iterate or you’re just feeling especially lazy and not willing to get a vectorized solution, iterrows has worse performance than any of the other ways of interacting with the data row-by-row:

df.to_dict() (returns a dict object),
df.to_records() (returns a np.recarray object)
df.itertuples() (returns a map of named tuples)
df.to_numpy() (returns a np.ndarray object)

Feel free to run the following code to test how fast and slow these various solutions are. Adjust the ROWS variable near the top to change how many rows you are working with. I use the “bare bones” solution as a Pandas-less benchmark; every other solution utilizes the Pandas DataFrame in some capacity.

import datetime
import numpy as np
import pandas as pd
from pprint import pprint

ROWS = 50000

#=======================================

my_data = list(range(1, ROWS + 1))
my_data_df = pd.DataFrame({'my_col' : my_data})

calcs =  {
    'bare bones': lambda li, df: [i * 2 + 1 for i in li],
    'vectorized': lambda li, df: df['my_col'] * 2 + 1,
    'apply': lambda li, df: df['my_col'].apply(lambda x: x * 2 + 1),
    'iterrows': lambda li, df: [row['my_col'] * 2 + 1 for index, row in df.iterrows()],
    'itertuples': lambda li, df: [i.my_col * 2 + 1 for i in df.itertuples()],
    'to_records': lambda li, df: [i['my_col'] * 2 + 1 for i in df.to_records()],
    'to_numpy (vectorized)': lambda li, df: df.to_numpy()[:, 0] + 1,
    'to_numpy (iterative)': lambda li, df: [i[0] for i in df.to_numpy()],
    'to_dict': lambda li, df : [i * 2 + 1 for i in df.to_dict()['my_col']],
    'iloc loop': lambda li, df: [df.iloc[i, 0] for i in range(len(my_data_df))]
}

diagnostics = {}

for test, func in calcs.items():
    start_time = datetime.datetime.now()
    calc = func(li=my_data, df=my_data_df)
    end_time = datetime.datetime.now()
    diagnostics[test] = (end_time - start_time).microseconds

pprint(diagnostics)

If everything I’ve said so far is correct, you might expect that the vectorized solution will be the best, followed by the “bare bones” iterative solution, followed by everything else, with itertuples last or close to last. (The reason why the bare bones solution will be the best non-vectorized solution is because all the other solutions other than the vectorized solutions are the bare bones solution but with extra steps e.g. converting data in a DataFrame into tuples or a dict.)

By running the above code on my personal laptop, I get the following results:

>>> pprint(diagnostics)
{'apply': 33554,
 'bare bones': 11996,
 'iloc loop': 811999,
 'iterrows': 689649,
 'itertuples': 83192,
 'to_dict': 37311,
 'to_numpy (iterative)': 22984,
 'to_numpy (vectorized)': 4164,
 'to_records': 343681,
 'vectorized': 9842}

Each approach is worse than the two vectorized solutions using either Numpy or Pandas. iterrows is the 2nd worst approach, behind the iloc loop. The two worst approaches are worse than the vectorized approaches by multiple orders of magnitude.

Even in the most dire of situations, the itertuples method works perfectly fine, similarly to iterrows, and I’ve not seen any odd behaviors with it. So I genuinely do not see a use for iterrows ever. Don’t use it.

One Weird Use Case Where `apply` Sucks

I have so far found exactly zero acceptable use cases for iterrows, but I know of a generic use case where itertuples outperforms apply, at least on my computer. Specifically, it is faster to iterate over the tuples of an entire DataFrame object than to apply that DataFrame with axis=1. Before when I recommended apply both for performance and aesthetic purposes, I only looked at a single column, i.e. a Series object. Here there is a trade-off between aesthetics and performance.

Feel free to test out what I’m saying with the following code:

import datetime
import pandas as pd
from pprint import pprint

ROWS = 50000

df = pd.DataFrame({
    'a': list(range(1, ROWS + 1))
})

df['b'] = df['a'] * 2
df['c'] = (df['a'] - 100) * 1.5

def func_apply(df):
    return df.apply(
        lambda x: x['a'] + x['b'] - x['c'],
        axis=1
    )

def func_itertuples(df):
    return pd.Series(
        [x.a + x.b - x.c
         for x
         in df.itertuples()]
    )

calcs =  {
    'apply': func_apply,
    'itertuples': func_itertuples
}

diagnostics = {}

for test, func in calcs.items():
    start_time = datetime.datetime.now()
    calc = func(df=df)
    end_time = datetime.datetime.now()
    diagnostics[test] = (end_time - start_time).microseconds

pprint(diagnostics)

Unfortunately, there are a good number of cases where this is relevant.

The most common general instance of this is that a lot of Pandas built-in methods require non-vectorized inputs, and when you want to vectorize the parameter, you need to use an iterative approach over multiple columns.

I’ll give an example so this is in less abstract terms. Let’s suppose you have a table full of User IDs and the UTC timestamp of some generic “interaction,” maybe say when they registered for the website, or clicked on a link, or whatever. (It doesn’t matter what the interaction is.) You also have a table full of data about each User ID, including the user’s timezone. You merged them into a single DataFrame. Your boss is hypothesizing that people love doing the “interaction” in the morning. He is interested in knowing whether people did the “interaction” before or after 12:00pm (noon) in the user’s local time on a given day. Of course, in order to do this, you need to offset by their timezone.

To make sure you’re doing everything correctly, you write the following code and run it:

import pandas as pd

df = pd.DataFrame(
    [(23920, 'America/Chicago', '2019-02-23 13:23:10'),
     (19729, 'America/New_York', '2019-02-25 21:17:24'),
     (12830, 'America/New_York', '2019-02-23 17:11:52'),
     (21683, 'America/Los_Angeles', '2019-02-23 09:31:22'),
     (84027, 'Europe/London', '2019-02-21 18:55:38')],
    columns=['user_id', 'timezone', 'datetime']
)
df['datetime'] = pd.to_datetime(df['datetime'])

# This should adjust the times by -5 hours
# (or -4 when DST is in effect)
print(
    df['datetime']
    .dt.tz_localize('UTC')
    .dt.tz_convert('America/New_York') # <- testing this!
    .dt.tz_localize(None)
)

So the solution to the actual problem should just be as simple as replacing .dt.tz_convert('America/New_York') with .dt.tz_convert(df['timezone']), right? Unfortunately not, and you get this error:

>>> print(
...     df['datetime']
...     .dt.tz_localize('UTC')
...     .dt.tz_convert(df['timezone']) # <- for real this time!
...     .dt.tz_localize(None)
... )
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "~\Python\Python37\lib\site-packages\pandas\core\accessor.py", line 91, in f
    return self._delegate_method(name, *args, **kwargs)
  File "~\Python\Python37\lib\site-packages\pandas\core\indexes\accessors.py", line 94, in _delegate_method
    result = method(*args, **kwargs)
  File "~\Python\Python37\lib\site-packages\pandas\core\accessor.py", line 91, in f
    return self._delegate_method(name, *args, **kwargs)
  File "~\Python\Python37\lib\site-packages\pandas\core\indexes\datetimelike.py", line 721, in _delegate_method
    result = operator.methodcaller(name, *args, **kwargs)(self._data)
  File "~\Python\Python37\lib\site-packages\pandas\core\arrays\datetimes.py", line 868, in tz_convert
    dtype = tz_to_dtype(tz)
  File "~\Python\Python37\lib\site-packages\pandas\core\arrays\datetimes.py", line 69, in tz_to_dtype
    return DatetimeTZDtype(tz=tz)
  File "~\Python\Python37\lib\site-packages\pandas\core\dtypes\dtypes.py", line 640, in __init__
    if tz:
  File "~\Python\Python37\lib\site-packages\pandas\core\generic.py", line 1478, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Basically, the issue here is that the Pandas method tz_convert only takes a string input, but in this case you want to use an array input, specifically an array of strings.

There are other situations where this general problem happens. Let’s say you have a DataFrame of three values: a “string”, “replace_this”, and “replace_with”. Naturally, you want to replace all instances of replace_this values in the string column with the text in the replace_with column:

import pandas as pd

input_data = [
    ('hello world', 'hello', 'goodbye'),
     ('how are you', 'are', 'were'),
     ('i am hungry', 'am', 'am not')
]

df = pd.DataFrame(
    input_data,
    columns=['string', 'replace_this', 'replace_with']
)

# This will raise an exception!
df['string'].str.replace(df['replace_this'], df['replace_with'])

One way you can solve this is:

>>> df.apply(lambda x: x['string'].replace(x['replace_this'], x['replace_with']), axis=1)
0      goodbye world
1       how were you
2    i am not hungry
dtype: object

The question though is if this is better than an itertuples approach. In this case, the answer is no and it’s not even close because (as stated before) apply over a DataFrame’s axis=1 is slower than itertuples. I’m not entirely sure why this is the case, but yet again it holds true. Try it yourself:

import datetime
import pandas as pd
from pprint import pprint

input_data = [
    ('hello world', 'hello', 'goodbye'),
     ('how are you', 'are', 'were'),
     ('i am hungry', 'am', 'am not')
]

df = pd.DataFrame(
    input_data * 1000,
    columns=['string', 'replace_this', 'replace_with']
)

def func_apply(df):
    return df.apply(
        lambda x: x['string'].replace(x['replace_this'], x['replace_with']),
        axis=1
    )

def func_itertuples(df):
    return pd.Series(
        [x.string.replace(x.replace_this, x.replace_with)
         for x
         in df.itertuples()]
    )

calcs =  {
    'apply': func_apply,
    'itertuples': func_itertuples
}

diagnostics = {}

for test, func in calcs.items():
    start_time = datetime.datetime.now()
    calc = func(df=df)
    end_time = datetime.datetime.now()
    diagnostics[test] = (end_time - start_time).microseconds

pprint(diagnostics)

Here is a version of the diagnostics test that uses the timezone conversion example:

import datetime
import pandas as pd
from pprint import pprint

input_data = [
    (23920, 'America/Chicago', '2019-02-23 13:23:10'),
    (19729, 'America/New_York', '2019-02-25 21:17:24'),
    (12830, 'America/New_York', '2019-02-23 17:11:52'),
    (21683, 'America/Los_Angeles', '2019-02-23 09:31:22'),
    (84027, 'Europe/London', '2019-02-21 18:55:38')
]

df = pd.DataFrame(
    input_data * 1000,
    columns=['user_id', 'timezone', 'datetime']
)
df['datetime'] = pd.to_datetime(df['datetime'])

def func_apply(df):
    return df.apply(
        lambda x: x['datetime']
                  .tz_localize('UTC')
                  .tz_convert(x['timezone'])
                  .tz_localize(None),
        axis=1
    )

def func_itertuples(df):
    return pd.Series(
        [x.datetime
         .tz_localize('UTC')
         .tz_convert(x.timezone)
         .tz_localize(None)
         for x
         in df.itertuples()]
    )

calcs =  {
    'apply': func_apply,
    'itertuples': func_itertuples
}

diagnostics = {}

for test, func in calcs.items():
    start_time = datetime.datetime.now()
    calc = func(df=df)
    end_time = datetime.datetime.now()
    diagnostics[test] = (end_time - start_time).microseconds

pprint(diagnostics)

Problem 1. Loops in Pandas are a sin.

Problem 2. There are better ways to do basically the same thing.

One Weird Use Case Where apply Sucks

Share this:

Related

One Weird Use Case Where `apply` Sucks