Making Good Code Great

Let’s say you wrote a really basic data import function that finds the latest .csv file in a directory (sorted alphanumerically, and the name equals the date), then imports it into a Pandas dataframe, and then does some light processing of the data. The processing is done in two steps: First, it formats the dates. Second, it removes all user ID’s that start with “admin”.

Here is what your code looks like, stored in a file called data_mgmt.py:

"""Data management module v1"""
import os
import re
import pandas as pd

DATA_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos/data'

def import_client_data():
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # 1. Get latest version of file
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    client_file = sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(DATA_DIR)
        for file in files
        if re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
    ])[-1]
    
    #~~~~~~~~~~~~~~~~
    # 2. Read in file
    #~~~~~~~~~~~~~~~~
    df = pd.read_csv(client_file)
        
    #~~~~~~~~~~~~~~~~
    # 3. Process data
    #~~~~~~~~~~~~~~~~
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    df = df.loc[~df['id'].startswith('admin')]
    
    return df

You open up Jupyter Notebook or Spyder or PyCharm and you type the following:

"""My analysis"""
import pandas as pd
import data_mgmt

df = data_mgmt.import_client_data()

Your code works. No problems. You test it a bit more. It still works.

The code is readable. It’s better than things you’ve seen other people build. It works perfectly as long as the syntax of the csv files works the way you want– but then again, you don’t actually want it to work on files that don’t adhere to the strict format you’ve specified in the regex, so that’s fine. It is PEP8 compliant.

Your boss compliments you on writing readable, effective, and simple code. Your boss is especially happy that you organized things using the comments to separate out the business logic of each step, and praises your effective use of regex and a list comprehension.

Everything is fine, the code is good, and you can move on to other tasks… right?

Obviously this code isn’t perfect, otherwise I wouldn’t be writing this post. The thing is, this is good code, but it’s not great code.

The reason why this code isn’t great has nothing to do with its expected behavior: the code does exactly what you want it to do when you type df = import_client_data(). And by the end of this blog post, the updated code is still going to do the exact same thing when you type df = import_client_data().

Breaking out the steps

One reason why this code isn’t great is because it’s not flexible.

Hopefully everything works out as planned, but over the course of the project you will run into some unexpected issues, and this structure cannot handle unexpected issues at all.

The biggest difference between pearly-eyed, fresh-out-of-college juniors and the hardened, cynical seniors with a couple years of experience is that the seniors have done a few business projects in their lifetime, and know how things go astray. It’s hard to explain all the ways in which errors happen, or what sorts of ad hoc requests you might expect. But there are a few good tricks to set up your work in advance to accommodate these things.

For example, let’s say your boss emails you: “Hey, there was an issue with the report on Sep 10, 2019. Can you analyze that?” The file named 2019_09_15.csv, which you need to import to do this task, is an older file. But your code always returns the latest file.

How would you use your existing code to start working on the problem? At the moment, you’d have to copy and paste the import code into you analysis and remove the part that searches for the latest dated file. That’s not a good solution!

There is a better way:

"""Data management module v2"""
import os
import re
import pandas as pd

DATA_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos/data'

def import_client_data():
    return process_client_file(
        find_latest_client_file()
    )

def find_latest_client_file():
    return sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(DATA_DIR)
        for file in files
        if re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
    ])[-1]
    
def process_client_file(client_file):
    # Read in file
    df = pd.read_csv(client_file)
    
    # Process data
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    df = df.loc[~df['id'].startswith('admin')]
    
    return df

By breaking out and exposing the underlying steps in the process of importing the data, you can now write the following code in your analysis file:

"""My analysis of Sep 15 data"""
import os
import pandas as pd
import data_mgmt

data_file = os.path.join(data_mgmt.DATA_DIR, '2019_10_15.csv')
df = data_mgmt.process_client_file(data_file)

(Note: See Appendix A for an alternative approach and why I didn’t use it in this example.)

Parameterization

It’s not just enough to separate out the steps into their own functions. We also want to add default keyword arguments to the functions.

One reason why people tend to get lazy when it comes to parameterization is because it seems really cognitively challenging and work-intensive. I think many of these people are approaching parameterization incorrectly. Instead of trying to write parameterized code from the get-go, it’s actually preferable to start with everything as hard-coded and un-abstracted as possible, and work your way toward abstracting things.

Watch how easy it is to abstract things. Here is the old version of find_latest_client_file:

def find_latest_client_file():
    return sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(DATA_DIR)
        for file in files
        if re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
    ])[-1]

Here it is with the data directory parameterized:

def find_latest_client_file(data_dir=DATA_DIR):
    return sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(data_dir)
        for file in files
        if re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
    ])[-1]

See how easy that was to change? Just two lines. Now our code is far more flexible, yet it is just as readable as before and the default behavior didn’t change.

What are practical examples where we might want to use something other than the default?

  1. If the client starts sending us data from two environments, e.g. “dev” and “prod”, we might want to separate out those two sources of data.
  2. Unit-tests. Now our code can parameterize data_dir to be a temp directory. Here is what a unit test for this might look like:
"""unit test for data_mgmt"""
import os
import pytest
import data_mgmt
from conftest import fixture_dir

def test_find_latest_client_file(fixture_dir):
    test_val = os.path.abspath(data_mgmt(data_dir=fixture_dir))
    target_val = os.path.abspath(os.path.join(fixture_dir, '2019_08_24.csv'))
    assert test_val == target_val

But overall, the best reason to parameterize this is because it’s trivial to do so. This parameterization did not add any complexity to the code and took a couple seconds of work, but it lets us handle potential issues down the road. If something has virtually zero cost but some upside, then it passes a cost-benefit analysis and you should do it.

Can we parameterize this more? Yes, but there is a caveat: use good judgment when doing so.

For example, what if we wanted to add a kwarg that specified whether we want to look in subdirectories for the latest file? By default, we look in subdirectories, but sometimes we might want to turn that off and look in only the root directory for a latest file, right? Here is what that code might look like:

# a bit excessive?
def find_latest_client_file(data_dir=DATA_DIR, subdirectories=True):
    return sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(data_dir)
        for file in files
        if (
            re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
            and (subdirectories or (root == data_dir))
        )
    ])[-1]

This code works the same as before if you never touch the subdirectories kwarg, but it has extra functionality for cases when you need it. It sounds like the code is now better because it does more, but is it actually better? Maybe, but not necessarily.

Planning for everything by adding things like this takes up engineering time, mental space with juggling all your kwargs, and textual real estate in both the source code and the docs. If you can’t think of many worlds where you might need this functionality, consider skipping it. Accept the fact that you can’t (and perhaps shouldn’t) plan for everything. (If there’s a reason why updating your data_mgmt.py would be extremely difficult down the line once you commit it, adding stuff like this becomes more and more appealing, but I haven’t given you a reason to believe it would be hard to do this as needed.)

In any event, this is how I would parameterize everything, in a way that balances work effort, code complexity, and future flexibility:

"""Data management module v3"""
import os
import re
import pandas as pd

DATA_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos/data'

def import_client_data(data_dir=DATA_DIR, print_output=False, remove_admin=True):
    return process_client_file(
        find_latest_client_file(
            data_dir=data_dir,
            print_output=print_output
        ),
        remove_admin=remove_admin
    )

def find_latest_client_file(data_dir=DATA_DIR, print_output=False):
    fn = sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(data_dir)
        for file in files
        if re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
    ])[-1]
    if print_output: print(fn)
    return fn
    
def process_client_file(client_file, remove_admin=True):
    # Read in file
    df = pd.read_csv(client_file)
    
    # Process data
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    if remove_admin:
        df = df.loc[~df['id'].startswith('admin')]
    
    return df

A few more changes

Adding a decorator to find_latest_client_file

One really unfortunate thing that happened when we added some print functionality was our really nice find_latest_client_file function became uglier, and went from a 1-liner to a 3-liner. There is a way to add the print functionality while still retaining the pretty 1-line solution and pushes the “state” created by the function outside of the function itself:

def print_output(func):
    def _wrap(*args, print_output=False, **kwargs):
        res = func(*args, print_output=False, **kwargs)
        if print_output: print(res)
        return res
    return _wrap

@print_output
def find_latest_client_file(data_dir=DATA_DIR, print_output=False):
    return sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(data_dir)
        for file in files
        if re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
    ])[-1]

(If you’re not familiar with decorators, here is an introduction. How decorators work is another post for another time.)

The reasons why this is a good use of a decorator are simple.

First, printing the output has nothing to do with the relationship between the input space and the output space. Ideally want your functions to represent these sort of input-output pairings, and things such as caching or what console you output the function can be decorated away.

Second, @print_output is a portable decorator that can be applied to more things down the line. It’s not unreasonable to believe that as the code develops, there may be more import functions that look like this. Instead of constantly putting that clunky if print_output: print(res) code in all your functions, just slap it in a decorator.

Better DATA_DIR directory declaration

Right now the directory is declared at the top of the document as a constant. That’s not bad! Please do things like this in your code; put finicky file paths at the top of your code, not buried deep inside of it where it’s harder to edit and diagnose.

With that said, you can still do much better than what is currently there. For starters, this is better:

ROOT_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos'
DATA_DIR = os.path.join(ROOT_DIR, 'data')

Do you see what problem this solves? Lets say you add more directories over time, but you never define the dirs relative to one another, so your code looks like this:

ROOT_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos'
SRC_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos/src'
DATA_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos/data'
OUTPUT_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos/output'
STATIC_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos/static'

Now what happens if you need to migrate your root directory to another drive? Now you need to change all the dirs. Boo! That’s a little more tedious than the version where everything is defined relative to the project’s root directory (this is true even if you’re using a bash terminal and fixing this with a sed pipe because it’s easier to find/replace all ROOT_DIRs than it is to search the whole path).

An even better approach is to avoid typing out '//ABC_CORP/client_work/xyz_inc/project_thermos' entirely. There are a couple ways to do this. Your project should either be in the PATH environmental variable or installed via setup.py. In which case, you should create a file called definitions.py in your src/ folder that defines these constants:

"""definitions.py"""
import os

SRC_DIR = os.path.dirname(os.path.abspath(__file__))
ROOT_DIR = os.path.abspath(os.path.join(SRC_DIR, '..'))
DATA_DIR = os.path.join(ROOT_DIR, 'data')
OUTPUT_DIR = os.path.join(OUTPUT_DIR, 'output')
STATIC_DIR = os.path.join(STATIC_DIR, 'static')

If you don’t like this approach for whatever reason, you can still define the root directory relative to the file’s directory. Let’s say the data_mgmt.py file is located inside //ABC_CORP/client_work/xyz_inc/project_thermos/src. Then you might do the exact same thing as above, just in the data_mgmt.py file itself.

# this is inside your `data_mgmt.py` file
SRC_DIR = os.path.dirname(os.path.abspath(__file__))
ROOT_DIR = os.path.abspath(os.path.join(FILE_DIR, '..'))
DATA_DIR = os.path.join(ROOT_DIR, 'data')
# etc. ...

Docstrings

These functions aren’t exactly the most complicated functions in the world, but you should still add docstrings to them now.

The reason you do this is because once you get over the mental barrier of adding a docstring right now, it becomes easier down the road to maintain the function when it actually needs the docstring.

I’m a fan of Sphinx docstrings (which I’ll implement below), but Google docstrings work too. Just add docstrings. Please!

Type hints … ?

This is a personal preference, but I don’t typically like fussing with type hints for arguments because they make the args look cluttered and ugly, and it requires importing stuff to handle things like functions that return functions and generic iterables.

I realize a lot of people are going to get mad at me for this. Oh well. Maybe one day I’ll change my mind and look back on this post and cringe, but type hints aren’t super widespread and fleshed out enough in the Python lexicon for me to care too much about if they’re there or not.

I don’t mind return type hints because they feel less cluttered there, and I include them in my final code.

My take is you should do whatever you want with type hints. They’re not so bad, and we only have two more months of caring about Python 2.x compatibility.

Final version of the code

Well, here it is, the final version of the code!

Remember, this code works exactly the same as version #1 when typing data_mgmt.process_client_file(). We did not fundamentally alter what the code does in the base case, we added to it to accommodate all the situations where we might need some changes:

"""Data management module vFINAL"""
import os
import re
import pandas as pd
from definitions import DATA_DIR

def import_client_data(data_dir=DATA_DIR,
                       print_output=False,
                       remove_admin=True) -> pd.DataFrame:
    """
    Finds the latest version of the client's data, then imports and processes
    it.
    
    :param data_dir: Dir (str) to find the file in.
    :param print_output: If true, prints the csv file that was imported.
    :param remove_admin: If true, removes all rows identified as admin rows.
    :returns: Pandas Dataframe
    """
    return process_client_file(
        find_latest_client_file(
            data_dir=data_dir,
            print_output=print_output
        ),
        remove_admin=remove_admin
    )

def print_output(func):
    """
    Wraps a function to take a print_output kwarg, which prints the output if
    True.
    
    :param func: Function being wrapped.
    :returns: Function
    """
    def _wrap(*args, print_output=False, **kwargs):
        res = func(*args, print_output=print_output, **kwargs)
        if print_output: print(res)
        return res
    return _wrap

@print_output
def find_latest_client_file(data_dir=DATA_DIR, print_output=False) -> str:
    """
    Finds the latest version of the client's file.
    
    :param data_dir: Dir (str) to find the file in.
    :param print_output: If true, prints the csv file that was imported.
    :returns: csv file name
    """
    return sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(data_dir)
        for file in files
        if re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
    ])[-1]
    
def process_client_file(client_file, remove_admin=True) -> pd.DataFrame:
    """
    Imports and processing the client csv file.
    
    :param client_file: File name (str) of client's csv.
    :param remove_admin: If true, removes all rows identified as admin rows.
    :returns: Pandas Dataframe
    """
    df = pd.read_csv(client_file)
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    if remove_admin:
        df = df.loc[~df['id'].startswith('admin')]
    return df

The point behind this exercise was that most code problems start with either obviously bad code or no code at all, and end with good or great code. However, it’s less often the case that these sorts of tutorials start with perfectly serviceable, readable, error-free, non-amateurish code and make it even better than before.

One of the biggest barriers I faced around when I considered myself familiar with Python, but not quite excellent at it or a professional level coder, was that I didn’t understand the many ways in which my decent code could still be improved upon. Hopefully this post helps you spot some omissions in your code.


Appendix A – Alternate approach to breaking out the steps

In our code, we combined the two steps of reading in the file and processing the raw version. In many real world situations, you want to break these out into two functions: a read_client_file function that returns an unprocessed dataframe, and process_client_data, which takes a raw dataframe and processes it. Something like this:

def read_client_file(client_file):
    return pd.read_csv(client_file)

def process_client_data(df):
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    df = df.loc[~df['id'].startswith('admin')]
    return df

def process_client_file(client_file):
    return process_client_data(read_client_file())

I think this is a bit excessive in this particular use case because pd.read_csv with defaults is a natural way to read in csv files, and it is easy to remember that the defaults work. I think once it becomes a little bit more complicated than this, it makes sense to break it out into a separate function.

Also note the naming convention in the structure outlined above. It’s clear that process_client_data takes “data” as an input based on the name, and it’s clear that process_client_file takes a “file” as an input based on the name. This is subtle but important. I see a lot of examples of code in the wild that play loosey-goosey when it comes to name-related things like this, but a function that takes a filename as an input is very different than a function that takes a dataframe as an input, and the names should make this clear if possible.

Appendix B – To use or not use **kwargs

One thing I could have done (but did not do) would have been to pass our options through **kwargs in import_client_data.

Look how easy everything gets! Tempting, right?

"""Data management module with **kwargs"""
import os
import re
import pandas as pd

DATA_DIR = '//ABC_CORP/client_work/xyz_inc/project_thermos/data'

def import_client_data(**kwargs):
    return process_client_file(
        find_latest_client_file(**kwargs),
        **kwargs
    )

def print_output(func):
    def _wrap(*args, print_output=False, **kwargs):
        res = func(*args, print_output=print_output, **kwargs)
        if print_output: print(res)
        return res
    return _wrap

@print_output
def find_latest_client_file(data_dir=DATA_DIR, print_output=False, **kwargs):
    return sorted([
        os.path.join(root, file)
        for root, dir, files in os.walk(data_dir)
        for file in files
        if re.match('[0-9]{4}_[0-1][0-9]_[0-3][0-9][.]csv', file)
    ])[-1]
    
def process_client_file(client_file, remove_admin=True, **kwargs):
    # Read in file
    df = pd.read_csv(client_file)
    
    # Process data
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
    if remove_admin:
        df = df.loc[~df['id'].startswith('admin')]
    
    return df

The benefit of this approach is that we can be totally arbitrary with how we align our kwargs in the import_client_data function. You’ve gotta admit, it is a lot cleaner and easier to read.

Unfortunately, this is not a good approach!!! This is the argument space equivalent of wrapping all your code in a try: <code> except: pass. **kwargs is not a catchall for all your bad and lazy use of keyword arguments!

**kwargs should be used to pass arbitrary keyword arguments (as we do, correctly, in the decorator) when you genuinely want arbitrary keyword arguments. If you want some good examples of how to use and not use **kwargs, check this post out.

The thing is, we want our code to raise exceptions with bad keyword arguments. It is often better to let your code loudly fail when given a bad input than to silently try to “fix” it.

A good test for whether **kwargs is appropriate or not is this: If I typed (verbatim) asdfjkl="hello world" into my function, would that be an unambiguously bad input, or would it depend on the context? If the answer to that question is that it would be bad always, do not use **kwargs.

%d bloggers like this: