Claire Duvallet

Thank you for 18 years of DVDs, Netflix

2023-04-22T00:00:00-07:00

Soon, Netflix will be canceling its DVD-by-mail program, the original service that helped Netflix crush Blockbuster and got us used to watching movies on-demand from the comfort of our homes before streaming was a thing. Perhaps not coincidentally, my dad cancelled my family’s subscription to the DVD service this winter. As my brother wisely put it upon hearing my dad’s news, “Netflix can finally stop buying physical DVDs now that their last customer cancelled!”

Of course, when I saw that Netflix had kept my parents’ entire DVD history I knew I had to look at the data. According to the history, my family signed up for Netflix in 2004 - that’s almost 20 years of DVDs! For most of that time, we were on a plan that let us have 3 DVDs concurrently. At some point while I was in high school, I was given full control over one of the 3 DVDs on our plan. Looking at the DVD history, though, this must have been before Netflix even had the concept of separate accounts - my Netflix account on my family’s plan only shows ~10 DVD rentals, but I distinctly remember years of freedom to discover indie movies and curl up in our game room watching movies on my own in high school. It was such a treat to have my own stream of movies that I had full control over. In fact, I still really miss watching indie movies and discovering other excellent movies from their trailers - Netflix’s algorithm really hasn’t figured me out as well as those trailers had.

Getting the data

Anyway, onto the data. When you log in to dvd.netflix.com (a separate website from netlix.com, lol), the history is very simply shown in a table.

I didn’t see any easy way to just export this table, and I didn’t really want to try too hard to find a legit way to scrape the site (especially since I figured there’d be complex auth to get around), so I started with the good ol’ “inspect page” method. (Actually, I started with copy-paste but that didn’t work.) Turns out the information was easily accessible in the html page itself, so I went ahead and just downloaded the html pages for my mom and dad’s account histories. My parents started each using their own account to rent DVD’s, once the concept of accounts was implemented, and my dad’s account had about 500 entries on it so I figured it might have different movies on it.

With a combination of BeautifulSoup’s documentation, poking around via Chrome’s inspect tool, and good ol’ ctrl-F, I was able to pretty easily figure out how to extract all the information I needed.

from bs4 import BeautifulSoup
import pandas as pd

import calmap

import matplotlib.pyplot as plt

def extract_one_movie_info(m):
    """m is a BeautifulSoup object with one movie's row of info"""
    position = m.find('div', 'position').text

    title = m.find('a', 'title').string

    # year, rating, and duration
    meta = m.find('p', 'metadata')
    year, movie_rating, duration = [x.text for x in meta.find_all('span')]

    # get the ratings
    user_rating = m.find('span',  attrs={'data-userrating': True}).attrs['data-userrating']
    avg_rating = m.find('span',  attrs={'data-userrating': True}).attrs['data-rating']

    ship_date = m.find('div', 'shipped').text
    return_date = m.find('div', 'returned').text

    return (
        position,
        {
            'title': title,
            'year_or_season': year,
            'movie_rating': movie_rating,
            'disc_or_duration': duration,
            'user_rating': user_rating,
            'avg_rating': avg_rating,
            'ship_date': ship_date,
            'return_date': return_date
        }
    )

def extract_movie_dict(soup):
    """soup is the parsed html containing the full DVD history.
    Returns a dict where the key is the position of the movie (i.e. row number)
    """

    # There should only be two tags with these, one with the full table of info and
    # one with some placeholder code that I'm assuming populates the front-end somehow
    hist = soup.find_all('div', id='historyList')[0]

    # Get all of the movie elements, they're in a  
 blocks woohoo!
    movies = hist.find_all('li', id=True)

    movie_dict = dict([extract_one_movie_info(m) for m in movies])

    return movie_dict

with open('DVD Netflix-Alain.html', 'r') as f1:
    soup1 = BeautifulSoup(f1, 'html.parser')

dad_dict = extract_movie_dict(soup1)

with open('DVD Netflix-Nadine.html', 'r') as f2:
    soup2 = BeautifulSoup(f2, 'html.parser')

mom_dict = extract_movie_dict(soup2)

assert len(dad_dict) == 508
assert len(mom_dict) == 1598

# | operator is new in python 3.9: https://docs.python.org/3/library/stdtypes.html#mapping-types-dict
movie_dict = {'mom_' + k: v for k,v in mom_dict.items()} | {'dad_' + k: v for k,v in dad_dict.items()}

# make sure we didn't drop any keys
assert len(movie_dict) == len(dad_dict) + len(mom_dict)

df = pd.DataFrame(movie_dict).T
df.head()

	title	year_or_season	movie_rating	disc_or_duration	avg_rating	ship_date	return_date
mom_1	Daughters of the Dust	1991	NR	1h 53m	3	12/05/22	Returned 12/29/22
mom_2	I Vitelloni	1953	NR	1h 43m	4	11/23/22	Returned 12/05/22
mom_3	Mutiny on the Bounty	1935	NR	2h 12m	3.9	11/18/22	Returned 11/23/22
mom_4	The Pervert's Guide to Ideology	2012	NR	2h 16m	3.7	10/04/22	Returned 11/17/22
mom_5	Diabolically Yours / The Widow Couderc	1967	NR	3h 2m	3.3	09/27/22	Returned 10/04/22

df['user_rating'] = df['user_rating'].astype(float)
df['avg_rating'] = df['avg_rating'].astype(float)

# Some more parsing to get the dates right
df['ship_date'] = pd.to_datetime(df['ship_date'], format='%m/%d/%y')
df['return_date'] = pd.to_datetime(df['return_date'].str.split(' ').str[1], format='%m/%d/%y')

df.head()

	title	year_or_season	movie_rating	disc_or_duration	avg_rating	ship_date	return_date
mom_1	Daughters of the Dust	1991	NR	1h 53m	3.0	2022-12-05	2022-12-29
mom_2	I Vitelloni	1953	NR	1h 43m	4.0	2022-11-23	2022-12-05
mom_3	Mutiny on the Bounty	1935	NR	2h 12m	3.9	2022-11-18	2022-11-23
mom_4	The Pervert's Guide to Ideology	2012	NR	2h 16m	3.7	2022-10-04	2022-11-17
mom_5	Diabolically Yours / The Widow Couderc	1967	NR	3h 2m	3.3	2022-09-27	2022-10-04

How many movies did we rent? (But first: a lot of data cleaning)

First, let’s get some summary statistics about how many movies we rented, and if any of those rentals were of movies we had already rented.

But before I can do that, I need to make sure that all the movies in my dataset are unique.

# Are all movies unique?
df.groupby('title').filter(lambda x: len(x['title']) > 1).sort_values(by='title').head(20)

	title	year_or_season	movie_rating	disc_or_duration	user_rating	avg_rating	ship_date	return_date
dad_461	1 Giant Leap	2002	NR	2h 35m	2.0	3.1	2007-10-19	2007-10-30
mom_1296	1 Giant Leap	2002	NR	2h 35m	0.0	3.7	2007-10-19	2007-10-30
dad_463	10 mph	2007	NR	1h 32m	3.0	3.5	2007-10-02	2007-10-10
mom_1300	10 mph	2007	NR	1h 32m	0.0	2.8	2007-10-02	2007-10-10
dad_361	127 Hours	2010	R	1h 34m	3.0	4.2	2011-07-06	2011-07-19
mom_974	127 Hours	2010	R	1h 34m	0.0	3.8	2011-07-06	2011-07-19
dad_68	1917	2019	R	1h 59m	0.0	4.7	2020-09-22	2020-10-07
mom_179	1917	2019	R	1h 59m	0.0	4.3	2020-09-22	2020-10-07
dad_221	45 Years	2015	R	1h 35m	5.0	3.8	2016-07-18	2016-08-02
mom_571	45 Years	2015	R	1h 35m	0.0	3.4	2016-07-18	2016-08-02
mom_148	A Bad Moms Christmas	2017	R	1h 44m	0.0	2.9	2021-02-09	2021-02-16
dad_56	A Bad Moms Christmas	2017	R	1h 44m	0.0	3.7	2021-02-09	2021-02-16
mom_888	A Better Life	2011	PG-13	1h 37m	0.0	3.8	2012-07-17	2012-07-31
dad_325	A Better Life	2011	PG-13	1h 37m	4.0	4.4	2012-07-17	2012-07-31
dad_422	A Collection of 2007 Academy Award Nominated S...	2007	NR	3h 15m	0.0	3.9	2009-09-01	2009-09-18
mom_1133	A Collection of 2007 Academy Award Nominated S...	2007	NR	3h 15m	0.0	3.5	2009-09-01	2009-09-18
mom_293	A French Village: The Complete Collection	Collection All	NR	Disc 9	0.0	4.6	2019-08-29	2019-09-09
mom_292	A French Village: The Complete Collection	Collection All	NR	Disc 10	0.0	4.6	2019-09-04	2019-09-10
mom_296	A French Village: The Complete Collection	Collection All	NR	Disc 7	0.0	4.6	2019-08-19	2019-08-29
mom_291	A French Village: The Complete Collection	Collection All	NR	Disc 12	0.0	4.6	2019-09-10	2019-09-17

Answer: definitely not.

Looks ike there are two ways for a movie to be duplicated: (1) it’s part of a TV series collection, or (2) it shows up on both mom and dad’s histories. For #2, I initially thought it would only be when they both rated the movie differently, but you can see above that there’s one example (“A Bad Moms Christmas”) where neither of them rated it, but for some reason they have difffferent average ratings. It’s very possible that my assumption of what the “data-rating” field means is wrong (maybe it’s the average rating for the type of user they’ve categorized each acount in?), or perhaps there’s something about when the data was last updated (though that would be strange because I downloaded both of these histories on the same day).

Anyway, let’s keep going and worry about figuring out the ratings later, if at all. For now, it looks like what I care about as a unique rental is a unique combination of title and “duration” column. Did we ever rent the same movie twice?

df.groupby(['title', 'disc_or_duration']).filter(
    lambda x: len(x['ship_date'].unique()) > 1
).sort_values(by=['title', 'disc_or_duration']).head(20)

	title	year_or_season	movie_rating	disc_or_duration	user_rating	avg_rating	ship_date	return_date
mom_667	A Most Wanted Man	2014	R	2h 1m	0.0	3.9	2015-05-05	2015-05-12
mom_693	A Most Wanted Man	2014	R	2h 1m	0.0	3.9	2015-01-21	2015-02-02
dad_262	A Most Wanted Man	2014	R	2h 1m	4.0	4.2	2015-01-21	2015-02-02
mom_797	Anchorman: The Legend of Ron Burgundy	2004	UR	1h 34m	2.0	2.4	2013-08-20	2013-08-24
mom_1382	Anchorman: The Legend of Ron Burgundy	2004	UR	1h 34m	2.0	2.4	2006-11-27	2006-12-05
dad_297	Anchorman: The Legend of Ron Burgundy	2004	UR	1h 34m	2.0	2.9	2013-08-20	2013-08-24
dad_491	Anchorman: The Legend of Ron Burgundy	2004	UR	1h 34m	2.0	2.9	2006-11-27	2006-12-05
mom_175	Awakenings	1990	PG-13	2h 0m	0.0	4.1	2020-10-06	2020-10-15
mom_642	Awakenings	1990	PG-13	2h 0m	0.0	4.1	2015-08-24	2015-09-02
mom_795	Bad Education	2004	NC-17	1h 46m	4.0	4.1	2013-08-24	2013-08-31
mom_1229	Bad Education	2004	NC-17	1h 46m	4.0	4.1	2008-06-20	2008-07-15
dad_296	Bad Education	2004	NC-17	1h 46m	0.0	4.1	2013-08-24	2013-08-31
mom_312	Before Sunset	2004	R	1h 20m	2.0	3.8	2019-04-24	2019-05-13
mom_1371	Before Sunset	2004	R	1h 20m	2.0	3.8	2007-01-08	2007-01-22
mom_1476	Before Sunset	2004	R	1h 20m	2.0	3.8	2005-10-25	2005-11-09
dad_126	Before Sunset	2004	R	1h 20m	4.0	3.4	2019-04-24	2019-05-13
mom_258	Big Little Lies	Season 2	TV-MA	Disc 1	0.0	4.4	2020-01-24	2020-02-11
mom_429	Big Little Lies	Season 1	TV-MA	Disc 1	0.0	4.4	2018-01-23	2018-02-05
dad_109	Big Little Lies	Season 2	TV-MA	Disc 1	5.0	5.0	2020-01-24	2020-02-11
dad_167	Big Little Lies	Season 1	TV-MA	Disc 1	5.0	5.0	2018-01-23	2018-02-05

Ok, so yes there are definitely movies that were rented multiple times! Interestingly, some of these are duplicated between both mom and dad’s accounts but some aren’t. I’m assuming this has to do with how Netflix handled the transition between not having accounts, having distinct accounts in the same family plan, and perhaps also how the DVD’s were shared across accounts vs. assigned to individual accounts.

I don’t really care about those intricacies, since I’m just taking this data as the holistic family plan. Let’s get rid of these fake duplicates, and consider coming back later to do some “mom vs. dad” analyses.

Actually, before we move on let’s check that the two lists are actually unique…

[k for k in dad_dict if k not in mom_dict]

[]

D’oh! The history I downloaded from my dad’s account is a subset of the history in my mom’s. So I could have simplified this whole thing by just looking at her history export, though I suppose that would have removed any of the mom vs. dad rating discrepancies. Anyway, I don’t care about those so let’s get on with the plan.

print(f'With duplicates: {df.shape}, without: {df.drop_duplicates().shape}')
df = df.drop_duplicates()

With duplicates: (2106, 8), without: (2090, 8)

# Ok now how many movies did we rent twice?
print(f"Total rentals = {df[['title', 'year_or_season', 'disc_or_duration', 'ship_date']].drop_duplicates().shape[0]}")

# There must be a way to use transform instead of all these reset_index...
(df[['title', 'year_or_season', 'disc_or_duration', 'ship_date']]
 .drop_duplicates()
 .groupby(['title', 'year_or_season', 'disc_or_duration'])
 .size()
 .reset_index(name='n_rentals')
 ['n_rentals'].value_counts()
).reset_index().sort_values('n_rentals')

Total rentals = 1596

	n_rentals	count
0	1	1441
1	2	76
2	3	1

In total, my parents made a total of 1596 unique rentals. Of these, 1441 were for movies they rented only once. They rented 76 movies twice, and one movie three times. Let’s see what the lucky movie was!

(df[['title', 'year_or_season', 'disc_or_duration', 'ship_date']]
 .drop_duplicates()
 .groupby(['title', 'year_or_season', 'disc_or_duration'])
 .filter(lambda x: len(x['title']) >= 3)
)

	title	year_or_season	disc_or_duration	ship_date
mom_312	Before Sunset	2004	1h 20m	2019-04-24
mom_1371	Before Sunset	2004	1h 20m	2007-01-08
mom_1476	Before Sunset	2004	1h 20m	2005-10-25

Looks like it’s Before Sunset, which makes sense - it came out in 2004, and was probably a movie that my parents and I both rented separately while I was in high school, and then that I guess my parents re-watched in 2019.

Let’s see how big of a gap there was between the two rentals for the movies we rented twice.

double_rentals = (df[['title', 'year_or_season', 'disc_or_duration', 'ship_date']]
 .drop_duplicates()
 .groupby(['title', 'year_or_season', 'disc_or_duration'])
 .filter(lambda x: len(x['title']) == 2)
)

delta_rental = (
    double_rentals.groupby('title').apply(
        lambda x: x['ship_date'].max() - x['ship_date'].min()
    ).reset_index(name='time_between_rentals')
)

delta_rental['days_between_rentals'] = delta_rental['time_between_rentals'].apply(lambda x: x.days)
delta_rental['years_between_rentals'] = delta_rental['days_between_rentals'] / 365.

ax = delta_rental['years_between_rentals'].plot(kind='hist')
ax.set_title('Gaps between renting the same movie twice')
ax.set_xlabel('Years between rentals')
ax.set_ylabel('Number of movies')

Text(0, 0.5, 'Number of movies')

delta_rental.sort_values(by='years_between_rentals', ascending=False)[['title', 'time_between_rentals', 'years_between_rentals']]

	title	time_between_rentals	years_between_rentals
66	Wasabi	4891 days	13.400000
23	Heat	4850 days	13.287671
65	War Dance	4672 days	12.800000
50	The Lady Vanishes	4613 days	12.638356
38	Raise the Red Lantern	3828 days	10.487671
...	...	...	...
55	The Miseducation of Cameron Post	15 days	0.041096
24	Homeland	14 days	0.038356
34	Once Upon a Time in Hollywood	4 days	0.010959
45	The Curious Case of Benjamin Button	1 days	0.002740
60	The Soloist	1 days	0.002740

69 rows × 3 columns

df.query('title == "The Curious Case of Benjamin Button"')

	title	year_or_season	movie_rating	disc_or_duration	user_rating	avg_rating	ship_date	return_date
mom_1155	The Curious Case of Benjamin Button	2008	PG-13	2h 46m	4.0	4.0	2009-06-16	2009-06-23
mom_1156	The Curious Case of Benjamin Button	2008	PG-13	2h 46m	4.0	4.0	2009-06-15	2009-06-19
dad_428	The Curious Case of Benjamin Button	2008	PG-13	2h 46m	5.0	4.1	2009-06-16	2009-06-23

There are a handful of movies that we rented twice ten years apart, and some that we rented one day apart! The ones that we rented one day apart might be a fluke, or perhaps they were movies that we had on the waitlist with both of our accounts and didn’t coordinate to not duplicate them. That makes sense.

Anyway, now that I understand the data much better, let’s dig in some more to all the non-rating-related information.

df = df[['title', 'year_or_season', 'disc_or_duration', 'ship_date', 'return_date']].drop_duplicates()

# first and last movie rental?
# how many movies per month?
# average duration of keeping movies

(df['return_date'].max() - df['ship_date'].min()).days / 365.

18.284931506849315

Wow, we were signed up for Netflix’s DVD service for over 18 years! That’s pretty amazing, and probably outlasts every commitment my parents made apart from maybe their longest jobs and homes and, oh right, their kids.

1596 rentals over 18 years is a little over 88 movies per year, which is about 1.5 movies per week for 18 years. Considering that we were subscribed to a plan with 3 DVD’s for the majority of that time, that’s a pretty impressive utilization rate.

Let’s see if we can visualize this data nicely. I’ll use the return date as a proxy for when the movie was watched, since we were usually pretty prompt about returning the movies after watching them.

Side note for the folks interested in data wrangling: I realized that I have 1596 unique rentals when considering the ship date, but 1598 unique rows in the data when including the return date for this analysis. It looks like there are two movies with the same ship date but different return dates; I’m assuming that’s a bug in Netflix’s data, unless my parents and I both rented the same DVD on the exact same day and returned them both exactly three days apart. Given that I’ve still never seen The Big Sick and don’t know what The Harvey Girls is, I’m betting on dirty data.

df.groupby(
    ['title', 'year_or_season', 'disc_or_duration', 'ship_date']
).size().reset_index(name='size').sort_values(by='size')

	title	year_or_season	disc_or_duration	ship_date	size
0	1 Giant Leap	2002	2h 35m	2007-10-19	1
1067	Summer Hours	2008	1h 43m	2015-09-02	1
1066	Suits	Season 1	Disc 1	2012-05-22	1
1065	Suffragette	2015	1h 46m	2016-12-06	1
1064	Stranger than Paradise	1984	1h 29m	2010-09-27	1
...	...	...	...	...	...
525	God Grew Tired of Us	2006	1h 29m	2011-08-23	1
534	GoodFellas	1990	2h 25m	2007-03-12	1
1595	Zero Dark Thirty	2012	2h 37m	2013-04-26	1
1140	The Big Sick	2017	1h 59m	2017-10-24	2
1244	The Harvey Girls	1946	1h 41m	2021-12-13	2

1596 rows × 5 columns

df.query('title == "The Big Sick"')

	title	year_or_season	disc_or_duration	ship_date	return_date
mom_445	The Big Sick	2017	1h 59m	2017-10-24	2017-10-30
mom_446	The Big Sick	2017	1h 59m	2017-10-24	2017-10-27

df.query('title == "The Harvey Girls"')

	title	year_or_season	disc_or_duration	ship_date	return_date
mom_68	The Harvey Girls	1946	1h 41m	2021-12-13	2021-12-20
mom_69	The Harvey Girls	1946	1h 41m	2021-12-13	2021-12-17

# Remove the extra rows
df = df.drop(['mom_69', 'mom_446'])

Ok, back to your regularly programmed visualization. I did a quick google and stumbled across a library called calmap which seems to make Github-style calendars super easily. Heck yes, let’s give it a try!

18 years of rentals

fig, ax = calmap.calendarplot(
    data=df.groupby('return_date').size(),
    vmin=0,
    ncols=3,
    fig_kws={'figsize': (15, 10)},
    yearlabel_kws={'fontsize': 14, 'color': 'gray'}
)

Daily rental patterns

First off, a quick guide to reading this sort of plot (which, despite staring at so many on Github, I’ve never really understood super well). Each plot shows 365 boxes, where each box is a day. Each row is a day of the week and each column is a week. The boxes are colored by how many DVDs were returned on that day (darker means more DVDs). So if you see a row with lots of filled-in boxes like a horizontal line, that means that we returned DVDs on the same day of the week across multiple weeks. Let’s say it’s the second row from the top, that would mean that Tuesdays are a frequent return day. Seeing a column of filled-in boxes would mean that we returned DVDs every day on a given week.

Ok, now that we’re oriented we can start to pick out some patterns. The first sort of pattern that sticks out to me is about which days we returned the DVDs. First, it looks like we rarely returned movies on two different days per week - you can see that there are very few columns with two filled-in boxes. Second, we never return DVDs on Saturday or Sunday (there are no filled-in boxes in the bottom two rows). It would make sense for Netflix’s DVD receiving department to be closed on weekends, so that checks out. Finally, it looks like our most frequent return day is Tuesday – that makes perfect sense! My parents tend to watch movies over the weekend, which means they would get picked up by USPS on Monday and received by Netflix the following Tuesday.

Another observation is that we would sometimes go months without returning any DVDs - you can see this as areas where there are multiple columns in a row of empty boxes. My guess is that these likely correspond to periods when my parents were on vacation, out of town, or otherwise busy. You can see examples of these gaps in June and July of a handful of years, which is what tipped me off to this vacation hypothesis. But there’s a lot of gaps, so I don’t think I’ll ask them to corroborate this hypothesis with the most recent dates of their big RV trips.

Return day consistency seems informative

Finally, the consistency of which days of the week we returned DVDs is intriguing - there are some years where it’s really consistent (the filled-in boxes are all on the same two-ish rows) and others where it’s not. From just looking at the plot, it seems that 2005-2009 didn’t have super consistent return days of the week - that makes sense, this was the period where I lived at home and had my own dedicated DVD (before the days of password sharing, this is how we shared accounts!). My guess is that I either watched movies on weekdays sometimes or, more likely, was less prompt at returning them after I watched them, which would explain the variety of return days.

The following 5 years, 2020-2015, had a much more consistent return day pattern, with most returned on Mondays or Tuesdays. This also makes sense, as this was the period where my parents were both empty nesters but still working: during this time, they would have been more likely to watch movies on the weekend than during the week, thus mailing them back on Mondays and Netflix receiving them on Tuesdays.

Then 2016 and 2017 are less consistent again - my guess is that this is right around when my mom retired. When I was talking to my parents about this analysis, my mom mentioned that when she first retired she watched a lot of TV series on Netflix DVDs. I also know that it took my parents a while to start paying for all the streaming services, so it would make sense that in these first few years after retirement you see a lot less consistency in the return days, as my mom was likely burning through TV shows via Netflix’s DVD service!

2018 and onwards gets decently consistent again. My hypothesis here is that 2018 is around when my parents started paying for and using streaming services, so they stopped watching as many movies and TV shows on Netflix DVDs. That would leave Netflix DVDs only for the more obscure foreign films or recently released movies not yet available on streaming that they wanted to watch, and everything else would have been watched via streaming. In this scenario, it makes sense that the behavior would revert to a consistent early-week return date: my dad was still working, and so I assume that they watched the movies that they ordered from Netflix on the weekends, and my mom watched other things during the week via streaming services.

Finally, 2021 and 2022 are slightly less consisetnt and much more sparse than any of the other years. My dad retired in December 2021, which is when my parents started taking a lot of trips in their RV. But I don’t think that’s what explains the sparseness - my guess is that they switched their plan from 3 DVDs to 1 sometime in 2021, which led to the slow death of their usage of the service.

Bring in the parents: putting my hypotheses to the test

I texted my parents to see if I could confirm some of these hypotheses. First off, my mom retired in December 2015 – huzzah, I was right! Pretty cool that you can see her retirement just in the distribution of return days of the week.

Then, my dad told me that they had access to Netflix’s streaming as soon as it started in 2007, but my mom doesn’t think they started using it regularly until around 2015. My parents also got Apple TV Box in November 2020, which made streaming very easy across the various services. So that doesn’t check out with my “2018 is when they started streaming regularly” hypothesis - something else must have happened in 2018 that got them back to a more consistent DVD viewing pattern. Maybe my mom ran out of TV shows that Netflix had on DVD?

Finally, they switched to the plan with only 1 DVD in August 2022 - way after the 2021 sparseness started! So it must have gone the other way: their utilization was going down, and so they downgraded their plan.

Movie quantity over time

Next up, I want to look at a more high-level summary of the amount of movies we watched. My guess is that we watched way more while I was still living at home, and then that it spiked again after my mom retired. I might also guess that my parents watched more movies in 2020 and 2021 during Covid, but I’m not sure if that would be reflect in the number of DVDs since that’s also when they were using streaming services.

# Let's look at it monthly
df['return_month'] = pd.to_datetime(df['return_date'].dt.strftime('%Y-%m'))

movies_per_month = df.groupby('return_month').size().reset_index(name='n_movies')
movies_per_month.head()

	return_month	n_movies
0	2004-09-01	3
1	2004-10-01	8
2	2004-11-01	10
3	2004-12-01	10
4	2005-01-01	10

fig, ax = plt.subplots(figsize=(10, 5))
movies_per_month.plot(kind='scatter', x='return_month', y='n_movies', ax=ax)

ax.set_title('Number of movies per month')
ax.set_ylabel('Movies')
ax.set_xlabel('')

Text(0.5, 0, '')

Welp, nope - doesn’t look like there’s any discernible pattern in terms of the number of movies we watched over the years. It’s very interesting to me that you don’t see any obvious decreases when I moved out or even when my parents bumped their plan down to one DVD per month (but maybe that’s because there isn’t enough data to see that).

I wonder how this compares to the maximum possible number of movies per month. Let’s do some back-of-the-envelope math!

Assuming:

we have a plan that lets us have 3 DVDs at a time
we can only watch one movie per day
it takes Netflix one day to process a returned movie and ship out the next one
they send it with 2 day shipping to get to us
and when we mail it back, it goes with overnight return shipping

That means that each movie takes up a total of 5 days (1 day to be processed by Netflix + 2 days in the mail to get to us + 1 day to be watched + 1 day to return to Netflix). So each of the 3 DVDs can go through 6 full rental cycles per month, meaning that the max number of movies we could watch in a month is 18. On average, my family watched 7.3 movies per month – a little less than half of the possible rentals. But there were some months when we went through 14 movies, a 75% utilization rate! For a working family who definitely does not watch movies every day, not bad.

With that, thanks for joining me on this journey down Netflix memory lane! RIP Netflix DVD service, you were a true trailblazer ahead of your times, and those of us who were loyal fans for over 15 years thank you.

Early startup employee lessons learned, part 4: adapting to your changing role

2023-02-05T00:00:00-08:00

In Part 3 of this series, I wrote about strategies to build a team with a positive organizational culture. That post was about the team; this post is about you: as an early employee, how does your role change with the changing company and how can you gracefully ride that wave? When I joined my company as the fifth person, I knew that if all went well and our company succeeded, I’d have a very interesting path within the organization. What I didn’t know was any of the details of what that meant, and importantly what skills I’d need to gracefully walk that path. This post is about the things I’ve learned over the past two years as we’ve grown from 5 to 100 employees, and as my role has undergone countless transformations in the process of growing our team and responding to the company’s needs.

In the past two years, I’ve done some aspect of almost every job that we currently have full-time employees for at my company, except for some of the sales and finance roles. Software engineering, data science, operations, customer success, people management, strategy, marketing - you name it, I probably touched on it at some point in our period of hypergrowth. It’s been an amazing experience, but also a big challenge to grow and adapt my own role in the company as its needs have changed and as we’ve hired people to fill those needs on a full-time basis. One of the most important aspects of being an early employee is recognizing your place within the company and adapting gracefully as that changes.

Letting go of your legos

A blog post about scaling a startup wouldn’t be complete without linking to the famous “Letting go of your legos” article. But seriously, hiring people to take over work that you can no longer sustain doing isn’t enough – it’s critical to be intentional about giving those people you hire the space to take over the things that used to be your job.

When I first read that “give away your legos” article, I thought it’d be a piece of cake because of course I wanted people to take my jobs, I was doing way too many of them! But actually, I’ve learned that there’s more to it than that. In my mind, there’s two parts to giving away your legos: the first is giving them away, which is fairly easy to do if you’re experiencing burnout or you’re not a high ego person. The second is to let them make it their own, which is harder as an early employee because it means they might do a “worse” job of it than you. They won’t have the full context you do or the years of experience you have building the processes or projects from scratch. But you have to let them take your legos, and also give them space to make it their own.

One thing that helps here is realizing that there is often no “best” way to do something. It’s possible that the way they decide to do something is worse, or maybe it’s just different from how you’d do it. It’s also possible that they take your hacked-together solution and make it better, since it’s now their actual job to do this thing that used to be one of your millions of jobs. Second, even if they do end up doing a “less good” job of something than you were doing, the company will still experience a net gain simply by having an improved bus factor. For example, in my case, it was way worth having a slightly less accurate QC process because it meant that we could hire junior data analysts to do it instead of having multiple scientific PhDs spending hours of their time going through relatively rote QC.

Don’t grow too fast

One of the benefits of joining a startup as an early employee is that it’s a great opportunity to supercharge your career growth and quickly get promoted. If you’re the first data scientist like I was, it’s really enticing to shoot for growing into the head of your department quickly. But it might be more complicated than that: you might find yourself, as I did, in dire need of someone with more experience than you to help guide your work, so that you’re no longer learning by doing (and making mistakes) but instead having some seasoned perspective guiding your team. You might also find out that managing teams isn’t what you want to do: the diplomacy and people management might not be your thing, and you might prefer to stay on an IC track and focus on technical problems. That’s why it’s important to give yourself space to discover what you actually like, rather than rushing into a director-type role.

In my experience, I discovered that I actually don’t really like the internal diplomacy and relationship-leveraging needed to be an effective team lead. I found myself much preferring wielding influence in my sphere, focusing on how my team can better work together rather than addressing broader strategy issues that require a lot of negotiation and mind-changing across teams. I did end up in a team lead and management role for a lot of the past two years, but because I wasn’t ever officially promoted into a “Head of Data Science” or formal team lead role, I still had the flexibility to figure out where I wanted to end up. I’m really grateful that my founder and various managers took this approach, because it made it really easy to end up where I am now: as a technical lead, wielding influence as a high-level IC but not as a people manager. The other benefit is that we got to hire a great VP of Data Science who I’ve gotten to learn a lot from!

Common inflection points

If I look back on the past two years of growth and change in my role, I can identify a handful of inflection points.

First, I was alone: I was the only full-time data scientist and data engineer. We hired interns and contractors and full-time junior folks to support me, but I was still alone. There was no one to bounce ideas off of, nobody to review my architecture decisions, no one at my level to commiserate with about the growing pains we were experiencing. It was exciting because I had so much influence, but it was also really lonely.

Then, we started building out our team: I wasn’t alone anymore, but I was in charge of everyone so I was still basically alone. Our interns and contractors evolved into full-time folks, but I was the one managing them all, the only point between our data scientists and our founder. This was less lonely, because I had other people to help do the work with me. But there were still a lot of things I couldn’t engage with these colleagues on because I was their manager, and it was still up to me to listen to their concerns and try to figure out something to do about them even if I was struggling with similar issues.

The third inflection point came when we hired even more people and re-organized our team – suddenly I had peers! In our case, two of the three senior folks I had been managing were promoted into management roles at my level and the third moved out of my reporting line. This was the biggest inflection point for me: suddenly, I had peers who I wasn’t managing on both technical IC and leadership work. I also finally had people I could commiserate with without censorship - my fellow group leads. At this point, maybe you’ll have hired a boss or maybe not. For us, we didn’t have a boss yet and that was fine - we were in constant communication and were able to present a solidified front to our leadership when needed. But the inflection point wasn’t about having a boss or not, it was really about about finally having peers.

Finally, the fourth inflection point is when I become just another employee (which is still ongoing). Now, my team is getting to the point where it’s larger than what I can directly control or even have influence over, there are people at the company who I don’t know, and there are multiple layers between me and the executives or primary decision-makers. I think that if you had told me about this inflection point three years ago when I joined, I would have feared it or been bummed about the idea of getting here. But now that I’m living through it, I love it. The fact that there are people at our company who don’t know me means that we are successfully growing and scaling, as I discussed in part 2 of this series.

Remaining a leader

Even after this fourth inflection point when you’re finally just another employee, you’re not really just another employee. There’s nothing that will change the fact that you were there when it started: you know the context, you have a lot of the scoops, you see the big picture in a way that many others may not. Many folks will be excited to learn from you regardless of your actual role.

Your opinion will probably still matter a lot, at least to some people. You should be careful with it. Interestingly, as our company has grown, I’ve made much more use of DM’s rather than public messages, despite being the queen of surfacing! But as my role has changed, knowing that my words and actions carry a certain amount of weight is important. So I’ll go to private messages first to share feedback, strategize responses to tricky situations, and make sure I’m not overstepping on somebody else’s work or opportunity to respond.

Additionally, even when you do become another employee and most new hires have no idea who you are, the people who were there when you were central to the company will still view you that way, even if you no longer are. So even as you grow and the majority of new folks don’t know who you are, you still have to recognize your potential impact, especially on folks from other teams who haven’t followed along with your changing role as closely - they have no idea what your job is now, but they still remember what your job was back then. One concrete way this manifests is that you may still get tagged into questions and issues by those folks randomly here and there, since they may not know that entire teams have been hired to fulfill the role that you used to play. When that happens, it’s your job to redirect them to the people and teams whose job it is to actually do those things now.

Finally, it’s important to recognize when it does make sense for you to step back up a play a larger part of issues than your new role may call for. For me, this has come up in two ways: first, stepping up to coordinate large cross-functional or high-risk projects that need someone with a big picture view of the many different components of the business. Second, when meeting new hires who have been brought on to tackle technical debt-related projects or other cross-functional work like program management, I find it important to make myself available to them and very explicitly offer up sharing the scoops. Like my boss says, I know where the bodies are buried and that can be really useful to make sure we’re not making the same mistakes we’ve made in the past. Most new hires don’t need to know about the bodies, but some do - and I make sure that they know I will happily give them the scoop any time they ask. Equally important is that I put the ball in their court - if they think that knowing the historical context will help them with their job, then they’ll reach out. But it’s possible that the historical context actually isn’t all that helpful to or wanted by them, in which case it’s important to respect that and let them do things their own way. Like many things about growing and changing with your company, it’s all about balance.

If you liked this post, check out the rest of the series on being an early startup employee:

Early startup employee lessons learned, part 3: building culture

2022-12-26T00:00:00-08:00

A core function of being an early employee at a rapidly growing startup is helping it grow! At my current job, I’ve helped the data science team grow from being just one person doing a little bit of everything (me!) to a team of over 15 people working in three focused sub-teams. Growing the data science team has been one of the best parts of my hyper-growth experience, and I’ve learned a lot of really useful lessons from it.

As I mentioned in Part 1, building an open and collaborative team culture on my team is one of my proudest achievements. Because I care a lot about organizational culture, a lot of this culture came about organically through my hiring decisions, early management role, and general positioning as an influential teammate. But as I reflect on the past two years, I can identify some specific things we did that contributed to our positive outcome.

Communicating about how we communicate

Talking about how we communicate and then adapting our communication behaviors, processes, and norms as the team grew has helped us maintain a functional working culture despite our rapid growth. I think this is also especially important given that we’re primarily remote: communication doesn’t come for free, it has to be an active effort. These efforts can be grouped into three categories: talking about communication, hiring for it, and implementing processes that encourage the meta-conversations.

Talk about it!

First and foremost, our teams talks about how we talk to and work with each other. When someone posts a message and also asks “is this the right channel to post this in?”, folks on my team actually answer their question. It’s been interesting to notice this, because we’re one of the only teams at our company that I’ve seen actually engage with the meta-question, which I think stems from the norm we have of talking about how we talk to each other.

We also encourage holding each other accountable to the shared communication norms we’ve set - for example, a core value of our team is that we share unfinished work early and often. That means that when someone posts an unfinished analysis in one of our private channels, we encourage that person to re-post in a public channel where more of our team can engage with their work.

My favorite way that our team talks about how we communicate is by giving cute names to specific communication strategies we see others use effectively. So far, my favorites are:

“pulling a Claire” which means “asking someone to surface a private message in a public channel.”
“pulling a Scott” which means “bring up an issue by making statements that nobody can disagree with and naively asking questions with the hopes of sparking the change you’d like to see.”
“pulling a Nadia” which means “addressing a vague request by calmly asking for more details and links to documentation if it exists.”

As we see our colleagues find ways to effectively communicate with us and others, we talk about what it is that makes them effective and learn from them.

Hire for communication

We also explicitly center communication in our hiring processes. One of my favorite things we ask our data science candidates is about sharing unfinished work. For example, we ask “how comfortable would you be sharing an analysis that isn’t fully polished to the CEO?” or “how do you know when your work is ready to share internally vs. externally?” With these questions, we’re gauging candidates’ approach toward sharing unfinished work and trying to understand their ability to make decisions on partial information, which are both critical parts of our culture and key to being successful data scientists at our company.

Also, our technical interviews usually consist of some sort of pairing exercise or hypothetical scenarios - when we walk candidates through these, we emphasize over and over that we’re less interested in their answers and more in hearing their questions and thought processes. If candidates jump straight to answers, we’ll explicitly reorient them to questions, asking them point-blank what sorts of questions they’d need to ask their users or stakeholders. At the end of day, candidates who don’t ask us any questions never get hired by our panels, and I find that we get much better signal on their technical ability from the questions they ask than the answers they give.

Implement processes to encourage meta-conversations

Finally, our team has explicit processes that encourage the conversations about communication. Some of my favorite examples of these are:

Sprint retros: we started using a baby version of agile when had zero project management expertise at the company, and we basically just made it up as we went along. One thing that sticks out to me from those early agile days was the retros: it was the first time we’d had a structured place to talk not just about the work that we did, but also how we did the work that we did. Importantly, because it was a semi-formal environment, it was all intentionally constructive: when we talked about what went well and not so well, we brainstormed together about how to improve the way that we do work. I think this really set the stage for our team’s culture of interrogating and then collaborating on solving our organizational problems.

Intentional slack channels: for a long time, we basically had two channels: one called #data for all public data-related things, and one #data-team-internal private channel for internal banter and existential “omg is all of our data wrong” conversations. We also had legacy product-specific #data-XX channels that nobody really knew how to use. As we grew, that wasn’t working for us anymore - the #data channel was cluttered with inbound requests from non-data science team members, and our posts with results and analyses weren’t getting enough engagement because the channel was just too busy. We were also worried about posting jargon-heavy or unpolished analyses in #data because we knew there were so many non-technical eyes on it, and so a lot of our work had started going to our private channel, which ran counter to our values of transparency and collaboration.

At that point, three sub-teams within the data science team had started to form, and so we decided to create intentional public channels for each team. We talked about it extensively within the data science team, and then made an announcement to the company: what channels we were deprecating, what channels we were creating, and - importantly - what each channel was for. We were very clear: our team-specific channels would be just for us, with unpolished and sometimes incorrect work or analyses - lurk at your own risk! This re-organization and intention-setting freed us from the paralysis of not knowing where to post, improved our ability to collaborate by removing our fear of unintended consequences, and also helped the broader organization learn how to engage with us more effectively.

Intention-setting disclaimers on documents: this one is the brainchild of a former colleague, but the majority of our team has since adopted it. Whenever we shared a document in a public channel, we add a disclaimer to the top indicating where this document is in its lifecycle (WIP, draft, ready for review, etc), what we want from folks who look at the doc (hold your comments, comments welcome), and whether we are comfortable with others circulating the document (circulation OK, do not circulate). These disclaimers are especially helpful when you’re working on something that you know a lot of people will have feelings about, but you don’t want to keep it a secret until it’s finished. It helps us act in the transparent way that we value while minimizing potential negative consequences from other teams who aren’t necessarily used to working this way. As a consumer of documents, I also find it extremely helpful to know what the author wants from me: should I hold my tongue, or are they ready for feedback? Can I share this broadly, or are they not ready for this to be disseminated yet?

Intentional onboarding

Another easy way to build a positive and collaborative culture is to bake your team’s values into standard onboarding tasks. Preparing a plan for new hires’ first 30, 60, and 90 days of their job is great, but realistically you only need a plan for the first 2-4 weeks. After that, things will have probably changed enough either with the company or with the new hire figuring out where they fit in their new role that a new path will have become clear. Instead, putting in effort to create standardized and intentional onboarding tasks that immediately ask new hires to put your team’s core values into practice is a great use of energy that may have more impact than individualized long-term planning.

On our team, onboarding involves two core activities: meeting a lot of people and making a plot. When we set up intro meetings for new hires, we intentionally go for a broader circle than the individuals they’ll be working with directly. As a team, we value collaboration and compassion, which means that we want our data scientists to understand how their work fits into the broader company and how they can help support others beyond the data science team. So during onboarding, we make sure to encourage meetings with not just close colleagues, but also nearby teams and individuals from unrelated teams who they would benefit from having met at least once.

The other part of onboarding, which I love, is to make a plot. We give every new hire the same ticket their first sprint: make a plot, any plot! Just get access to our data somehow, make a plot that shows anything at all, and post it in our public team channel. This activity emphasizes our values of transparency and collaboration. If folks post it in our private channel, we remind them that the task was to post it publicly; if they take a long time to post it because they haven’t found anything “interesting,” we remind them that the task is to make literally any plot at all and post it. By emphasizing the public channel and the literally any plot at all, we get new hires comfortable with sharing unfinished work publicly, which is core to how we want to work together. Importantly, we ask everybody to do this task - even our head of data science! By doing so, we encourage new hires to put our values immediately into practice, and show that the values are team-wide and that we’re serious about them.

Be the broken record

Finally, something I didn’t realize I was doing at the time but which I think has been very helpful in shaping our culture is to constantly give my team the rationale behind what I’m doing and what I see our company doing. I think that explaining the “why” behind decisions that we’re making helps build an engaged team. For example, helping individual contributors understand why the company is making certain decisions or prioritizing certain projects can help provide additional motivation and context for the work they’re being asked to do. And as an early employee and leader, explaining my thought processes behind my decisions can teach and empower others to learn how to make those decisions themselves in the future, thus scaling my influence without stretching me thinner.

For example, I was originally responsible for the team QC’ing our data. Part of that was to work with our QC analysts and customer success teams to decide how to approach tricky situations, for example if a customer’s results seemed a little wonky and we didn’t know whether to just release the results or also send along a pre-emptive explanatory note. Rather than just telling my team of junior analysts what I thought the right thing to do was, I would walk them through my thought process. After a few months of this, instead of tagging me in to make the decisions, they started applying the same thought process themselves and tagging me in to just confirm the answer they’d gotten to themselves. There’s nothing better than messages like that, I can tell you!

Because I repeated myself to the point of being the voice in their heads, I’ve been able to step away from these day-to-day decisions with basically no impact to the quality of their work. Of course, explicit training and documentation would probably be a better way to scale my knowledge, but at rapidly growing startups there sometimes isn’t time for that - and simply being a broken record is often a good substitute.

When in doubt, I’ve found that clearly spelling out the “why” behind what we’re doing can be a great substitute for process and documentation, and is a good way to help others connect the dots themselves and understand the bigger picture behind what they’re being asked to do.

If you liked this post, check out the rest of the series on being an early startup employee:

The Boston morning commute time warp

2022-12-11T00:00:00-08:00

Like many young adults our age, my partner and I did the classic pandemic move of fleeing the city and moving in with his parents. That’s how I discovered I actually really enjoy living in rural New Hampshire, and last December we officially moved in to our own place in southern New Hampshire.

While we work from home the majority of the time, there are still some days here and there where we drive in to Boston. One thing I’ve noticed is that there seems to be a time warp during Boston morning traffic, where it is physically impossible to arrive in Boston between a certain time period. And that time period happens to be around when you’d usually want to get in, roughly when work starts in the morning.

After finding myself puzzling over the optimal time to leave to get to Boston early but without spending too much unnecessary extra time in the car, I decided to look into it - with data!

I was hoping that there would be a google maps API or something I could use to programmatically generate a bunch of travel time estimates for the route between my house and Central Square, where I work. Unfortunately, it turns out that (1) the google maps API isn’t free (though there is a “free tier” up to a certain number of queries) and (2) using it to grab data without showing an accompanying map violates the API terms of service (section 3.2.3).

So instead, I just manually “generated” the data by inputting my destination and modifying the departure date and time. I collected data on the two primary routes I can take, one via I-93 S and the other via route 3, and covered times between 5 am and 10 am, which matches my intuition for when the Boston traffic time warp is. Because this was very manual data collection, I only did this for 5 days, from 9/12 to 9/16. For each date and time of departure, I tracked the google maps estimate of the shortest & longest duration, as well as what color those estimates were (like when the google maps estimate is red and you know you’re in for miserable traffic, that would be “red”).

The data I gathered looked like this:

import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import seaborn as sns

from datetime import timedelta

df = pd.read_csv('Commute - NH - Boston.csv')
df.head()

	date	depart_time	travel_time_min	travel_time_max	color	route
0	2022-09-12	5:00 AM	1h25	1h50	green	route 3
1	2022-09-12	5:00 AM	1h40	2h10	green	93
2	2022-09-13	5:00 AM	1h40	2h20	green	93
3	2022-09-13	5:00 AM	1h30	2h20	green	route 3
4	2022-09-14	5:00 AM	1h30	2h	green	route 3

Without even doing any data analysis, the first thing that struck me was that the estimates all seemed quite low. From personal anecdotal experience these maximum travel times feel more like optimistic estimates - it often takes 30-45 min longer than the initial expected arrival time, sometimes up to 60 min longer. The minimum time feels right - without any traffic, it’s about an hour and half. But even just the other day I drove into Boston and found myself stuck at Boston’s worst intersection - the one where you’re getting off I-90 east to get into Cambridge, that starts with a stressful left exit and goes into the terrible confusing traffic light intersection across the bridge onto River St. To be fair, I used to live right by that intersection so I really should have known better than be swayed by the supposedly shorter route 3 way, but alas. Anyway, that traffic alone added at least 20 minutes to my commute, and all at the very end of my trip so it felt like I was in a time warp with my estimated travel time staying the same while the minutes passed by.

The coloring also feels quite off, with the vast majority being green or orange and only a couple of commutes in the red. I would have expected many more of these time periods to be “red”. But maybe that’s reserved only for when Google knows that there’s currently an accident or other blockage? Because every single day I’ve driven in to Boston, the time estimate has turned red for at least part of my trip (if not the entire last third).

Anyway, let’s see what the data says! First, I have to do some wrangling to get all the dates and times processed in a way that will be amenable to plotting:

# Convert travel times to minutes
def convert_to_minutes(s):
    s = s.split('h')
    mins = float(s[0])*60
    if s[1]:
        mins += float(s[1])
    return mins

df['travel_time_min_minutes'] = df['travel_time_min'].apply(lambda x: convert_to_minutes(x))
df['travel_time_max_minutes'] = df['travel_time_max'].apply(lambda x: convert_to_minutes(x))

# Calculate estimated arrivals
df['depart_datetime'] = pd.to_datetime(df['date'] + ' ' + df['depart_time'])
df['arrival_time_min'] = df.apply(
    lambda row: row['depart_datetime'] + timedelta(minutes=row['travel_time_min_minutes']),
    axis=1
)
df['arrival_time_max'] = df.apply(
    lambda row: row['depart_datetime'] + timedelta(minutes=row['travel_time_max_minutes']),
    axis=1
)

# Give times a dummy date, since I want to just compare times regardless of day of the week
df['depart_datetime_nodate'] = df['depart_datetime'].apply(lambda d: d.replace(year=2022, month=9, day=8))
df['arrival_time_max_nodate'] = df['arrival_time_max'].apply(lambda d: d.replace(year=2022, month=9, day=8))
df['arrival_time_min_nodate'] = df['arrival_time_min'].apply(lambda d: d.replace(year=2022, month=9, day=8))

Let’s start with the simplest possible thing - what’s the relationship between the time I leave and the duration of the trip?

ax = sns.scatterplot(data=df, x='depart_datetime_nodate', y='travel_time_max_minutes', alpha=0.5)
ax.set_xlim([pd.to_datetime('2022-09-08 04:30:00'), pd.to_datetime('2022-09-08 10:00:00')])

ax.set_xlabel('Departure time (A.M.)')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))

ax.set_ylabel('Max total travel time (minutes)')

Ok, so anywhere from two to three hours - that checks out. Already here you can see one of the key points of this whole thing: the range of possible durations for the same departure time is huge! For example, leaving at 6:30 am can take anywhere from 130 to 180 minutes - that’s an hour difference for the same departure time!

Next up, let’s see which route seems to be faster:

route_df = df.pivot(
    index='depart_datetime', columns='route', values='travel_time_max_minutes'
)

fig, ax = plt.subplots()
ax.plot([110, 180], [110, 180], color='gray', linestyle='--', alpha=0.25)
ax = route_df.plot(kind='scatter', x='route 3', y='93', ax=ax)

ax.set_title('Route comparisons')
ax.set_xlabel('Route 3 travel time (min)')
ax.set_ylabel('I-93 S travel time (min)')

Interesting, this data seems to indicate that the routes are roughly equivalent but the I-93 route often takes longer than going on Route 3. While I see how a computer would think this, as a human it really doesn’t check out.

What I think might be going on here is that I-93 has more predictable traffic than Route 3, both in terms of locations and amount, and so its estimates are taking into account that traffic while the Route 3 estimates aren’t able to. Perhaps it would make sense for Google’s algorithm to incorporate its confidence in the amount & location of traffic when it gives you estimates for travel in the future. For a computer, the training data is likely very clear: 93 south has traffic in the same spots every single day. It’s easy to measure and very consistent, and therefore very very predictable. Route 3, on the other hand, theoretically should have less traffic because it’s not the main thoroughfare into Boston - the route goes through Nashua and then veers west to go around Boston before taking I-90 East back into Cambridge. Of course, though, there’s always traffic or accidents on this route - it’s just that maybe the traffic isn’t always in the exact same spot and so the algorithm isn’t confident enough in it to incorporate it in its predictions. (Though if you ask me, that gnarly intersection has always been predictably awful and Google always underestimates how much time it adds - it should have been incorporated into the algorithm by now! Come on neural nets, get it together!)

Anyway, let’s get directly to our question: is it possible to reliably arrive at a reasonable morning working time, or does the Boston traffic time warp make that a physical impossibility?

g = sns.FacetGrid(data=df, col='route', aspect=1.5)

g.map(sns.scatterplot, 'depart_datetime_nodate', 'arrival_time_max_nodate')

for ax in g.axes.flatten():
    ax.set_xlim([pd.to_datetime('2022-09-08 04:30:00'), pd.to_datetime('2022-09-08 10:00:00')])
    ax.set_ylim([pd.to_datetime('2022-09-08 06:30:00'), pd.to_datetime('2022-09-08 12:00:00')])
#    ax.legend(loc='lower right')

    ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
    ax.yaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))

    ax.set_xlabel('Departure time')
    if ax.get_ylabel():
        ax.set_ylabel('Arrival time')

Looks like the two routes are basically the same. Given that I’ve learned my lesson about route 3 from my recent experience, and that I-93 is a much more pleasant drive, I’ll focus on that for the rest of this deep dive into the time warp. I’m also only looking at the maximum estimated time provided by Google, since we know from personal experience that even that maximum is often an underestimate.

def format_time_axes(ax):
    ax.set_xlim([pd.to_datetime('2022-09-08 04:30:00'), pd.to_datetime('2022-09-08 10:00:00')])
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
    ax.set_xlabel('Departure time')
    ax.set_ylim([pd.to_datetime('2022-09-08 06:30:00'), pd.to_datetime('2022-09-08 12:00:00')])
    ax.yaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
    ax.set_ylabel('Arrival time')
    return None        

def basic_scatter(
    df, x='depart_datetime_nodate', y='arrival_time_max_nodate',
    format_y=True, format_x=True
):

    ax = sns.scatterplot(
        data=df.query('route == "93"'), x=x, y=y,
        hue='color', alpha=0.5,
        palette={'red': 'red', 'green': 'green', 'orange': 'orange'}
    )

    ax.legend(loc='lower right', title="Google's\nestimate\ncolor")

    format_time_axes(ax)

    return ax

ax = basic_scatter(df)

ax.fill_between(
    x=[pd.to_datetime('2022-09-08 05:30:00'), pd.to_datetime('2022-09-08 08:30:00')],
    y1=[pd.to_datetime('2022-09-08 12:00:00'), pd.to_datetime('2022-09-08 12:00:00')],
    alpha=0.1, color='orange'
)

Basically, leaving home any time between 5:30 am and 8:30 am puts me in the Boston traffic time warp: a period of unpredictable and highly variable traffic, when the trip can take a full hour more on a bad day than a good one. And despite Google’s conservative estimates of the badness of traffic (at least based on the color of the arrival estimates they provide), you can see that the worst times for traffic also fall within the time warp period.

We can look at the day-to-day variability using my favorite statistical method, eyeballing it (since each point is a day, it’s the vertical spread between points), or by calculating it directly:

# Difference in max travel time between days
(df.groupby('depart_time')['travel_time_max_minutes'].max()
 - df.groupby('depart_time')['travel_time_max_minutes'].min()
).reset_index(name='max_and_min_days_delta')

	depart_time	max_and_min_days_delta
0	5:00 AM	30.0
1	5:30 AM	40.0
2	6:00 AM	60.0
3	6:30 AM	50.0
4	7:00 AM	40.0
5	7:30 AM	40.0
6	8:00 AM	30.0
7	8:30 AM	20.0
8	9:00 AM	10.0
9	9:30 AM	10.0

During the Boston traffic time warp, your commute can differ by up to an hour depending on which day you leave. Outside of the time warp, though, it’s pretty consistent.

Taking this further - leaving on a bad day might get you to Boston at the same time as leaving a full hour later on a good day. (I’ll say it again: leaving at 7:30 am on a bad day means you arrive at the same time as leaving at 8:30 on a good day :sob: - think of the extra hour of sleep you could have had!!) You can see this because the worst arrival time for a given departure time is the same arrival time as the best arrival time for a departure time an hour later - in other words, the highest dot for a given departure time is at the same vertical level as the lowest dot for a departure time that’s an hour later.

Outside of the time warp, in contrast, travel time to Boston is quite stable at around max two hours. But within the time warp period, the max travel time to Boston can get up to 3 hours depending on the day of the week. And that’s not even counting accidents, road work, or whatever else Google can’t predict!

Let’s see if Google’s own estimates recapitulate the high variance during the time warp.

df['delta_min_max'] = df['arrival_time_max_nodate'] - df['arrival_time_min_nodate']
df.query('route == "93"').groupby('depart_datetime_nodate')['delta_min_max'].describe()['mean']

depart_datetime_nodate
2022-09-08 05:00:00   00:36:00
2022-09-08 05:30:00   00:42:00
2022-09-08 06:00:00   00:54:00
2022-09-08 06:30:00   00:58:00
2022-09-08 07:00:00   00:52:00
2022-09-08 07:30:00   00:44:00
2022-09-08 08:00:00   00:40:00
2022-09-08 08:30:00   00:32:00
2022-09-08 09:00:00   00:30:00
2022-09-08 09:30:00   00:28:00
Name: mean, dtype: timedelta64[ns]

df['delta_min_max_float'] = df['delta_min_max'] / pd.Timedelta(minutes=1)

ax = sns.scatterplot(data=df, x='depart_datetime_nodate', y='delta_min_max_float', alpha=0.5)
ax.set_xlim([pd.to_datetime('2022-09-08 04:30:00'), pd.to_datetime('2022-09-08 10:00:00')])

ax.set_xlabel('Departure time (A.M.)')
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))

ax.set_ylabel('Difference between earliest\nand latest arrival times (minutes)')

Yep, you can see that the range between the earliest and latest estimated arrival times that Google provides reflects the range we see when we look at the day-to-day variability in the latest arrival time. Google seems to have a narrower time warp though, with things really only getting dicey from 5:30 to 8 am. I’ve been caught by this before, flipping back and forth between different departure times the night before I have to head into Boston, simultaneously calculating how few hours of sleep I’m gonna get with the likelihood of there being traffic and my own confidence in Google’s optimistic estimates.

Ok great so the time warp exists, but that doesn’t solve my problem of still needing to drive into Boston sometimes. Let’s say I’d like to arrive to work between 9:30 and 10:30 am, what does my commute look like then?

ax = basic_scatter(df)

ax.fill_between(
    x=[pd.to_datetime('2022-09-08 04:30:00'), pd.to_datetime('2022-09-08 10:00:00')],
    y1=[pd.to_datetime('2022-09-08 09:30:00'), pd.to_datetime('2022-09-08 09:30:00')],
    y2=[pd.to_datetime('2022-09-08 10:30:00'), pd.to_datetime('2022-09-08 10:30:00')],
    alpha=0.3,
    color='orange'
)

Woof - on the absolute worst day, getting to Boston at 9:30 am means I’d have to leave at 6:30 am. But on the best day, I could leave at 7:30 am. To make the best use of my time and not get stuck in the time warp, I should really try to leave home at 8:30 or 9, which means I shouldn’t schedule anything important until after 11 am.

However, I usually end up leaving around 8 am because arriving at 11 am is a little bit too late and too disruptive to my workday. Leaving at 8 am is sort of the balance point for me where I’m comfortable gambling on it being a good day (and thus getting to Boston early enough to enjoy a leisurely coffee before my 11 am meetings), but not so early that if I get stuck in traffic I’ll be very annoyed at all the time I wasted. Also, two and half hours doesn’t feel too too bad for a commute in, but only because I don’t do it very often. From this analysis, though, it does seem like leaving at 8:30 am is probably a better bet - I don’t get to Boston that much later, but the day-to-day variability in my commute will be lower, thus leading to hopefully less frustration.

Anyway, I already mostly knew this - scheduling anything in Boston before 11 am is a gamble unless I’m willing to leave super early. I didn’t really realize just how early I’d need to leave - I would have guessed 6 am was fine but no, it’s 5 am or bust.

In conclusion, Boston morning traffic sucks and feels like a time warp, which the data confirms is a valid feeling to have. Google’s estimates are surprisingly optimistic, with the maximum arrival time corresponding best with my lived experience. (Note to self: ignore the earliest estimated time from now on). Also, Google thinks that Route 3 and I-93 are basically the same, but my experience shows that to not be true. Maybe Google needs to incorporate an “emotional frustration” parameter into their recommendation algorithm, which includes some weights related to the daily variability in the traffic as well as how well Google’s estimates actually perform. Finally, leaving between 5:30 and 9 am means that my commute is unpredictable and could potentially suck a lot. That means that trying to get into Boston between 7:30 and 10:30 am sucks, and I shouldn’t schedule any important meetings during that time period if I don’t want to have to leave super early. So long as my first meetings are after 11 am, 8:30 am seems to be the optimal time to leave, balancing the potential benefit of getting into Boston early enough to grab a coffee with the slight chance of hitting a bad traffic day and getting a little stuck in the time warp.

Next time, I’d love to do this analysis but stop just north of Boston and see what proportion of all this chaos is caused by the last 10 miles of the trip vs the other 60.

Early startup employee lessons learned, part 2: coping with the coaster

2022-11-25T00:00:00-08:00

Sometime last year, an advisor told me that one of the most useful tidbits he’d gotten about being in a startup was that the “roller coaster” description of startups isn’t quite right - it’s not really that you go up and down and up and down. When a startup is growing, it’s that the highs are higher and the lows are lower. It’s less of a roller coaster, and more of a sine wave with increasing amplitude. Somehow, this comforted me.

Coping with the coaster sine wave

As I wrote in Part 1 of this series, the most important thing that helped me cope was recognizing that unless you’re the founder, you can’t change the company. I encourage you to read Part 1 for more on this, but it’s worth reiterating because acknowledging one’s place within a growing company is a pre-requisite for any other coping strategy. Acknowledging this truth helped me let go of the ways I had envisioned myself influencing our company and come to peace with our new growth, culture, and my role in it. That said, I did pick up a few other strategies to help me cope with the emotional roller coaster/sine wave of being an early employee at an early-stage startup. These strategies are what I’ll focus on in this post.

Focus on your timeline

At the beginning of joining a startup, there are what feels like a million possible paths for the company to take. But as the company grows, the options necessarily narrow as you strive find product-market fit and investors ask for focus and demonstrated impact. It’s important to recognize these changes and change your perspective with them. Otherwise, it’s easy to get stuck wishing that things would go a different way, but that route branched off a long time ago on the company’s timeline and is no longer available as a realistic option. It’s easy to give in to the FOMO and want to still be involved in or informed about every project, but that’s become physically impossible as projects have multiplied (for more on this, I highly recommend this popular “Letting go of your legos” post). Or it’s easy to find yourself wanting to do a job that simply doesn’t exist anymore in this version of the company. Just like founders and CEO’s have to change their jobs every six months or so as the company needs different things from them, so do companies change what they need from early employees. It’s important to recognize that and be ready to adapt.

One of the best parts of being an early employee is that you get to be part of exciting conversations full of wild ideas and big dreams of where the company could go. But as the company grows, the people involved in those conversations change, and it’s very possible that you won’t be part of them anymore. This was especially true for me, as my role became consumed by day-to-day operations as we experienced our hyper-growth in a newly-remote working world. It was hard to suddenly find myself so far from the big picture, big dreams conversations I’d been invited to before. But it’s the founders’ jobs to dream - and like we mentioned in Part 1, if they don’t want or facilitate having you in there dreaming with them then there’s not much you can about it. That means that if the company’s growth goes in a direction you don’t love or you start to see some changes in mission that you don’t agree with, it may be really difficult to process emotionally. In general, early employees likely feel the pain of the company’s shortcomings more acutely than other employees, because we were there from the start. The pain of the delta between what could be and what is hurts us more, because we are more acutely aware of all the “what could have been’s.”

In the early stages of startup’s journey, all of the opportunities and timelines are still valid and available. But as the company grows, both it and its early employees must choose between opportunities, picking individual timelines and starting to travel down them. It’s important to not get stuck looking to your side at all the other parallel timelines you and your company aren’t on, because they’re not available - they branched off a long time ago, or were never available in this reality to begin with. I’ve found that the best thing to do is to focus on the timeline you’re currently on, recognize its merits in addition to its shortcomings, and put efforts toward making that timeline the best it can possibly be.

Divest from the mission (a little)

One useful way I’ve found to cope with getting farther from the early stage dreams is to actually divest myself from them. As startups grow, early employees go from being an integral contributor across all aspects of the company to just regular old employees. From the company’s perspective, you go from being key and core element of its survival to just another cog in the machine. So it helps to do the same, and change your attitude toward the company: from a key and core element of your life, to just another part of your participation in capitalism. Recognizing that the job is just a job can be an important way to gain perspective and emotional distance from the sine wave. Of course - it’s still a really great job, ideally with meaningful impact and opportunities for growth, but still a job nonetheless.

Update your comparators

When I’m feeling burned out or stressed, or if startup chaos is generally getting me down, I’ve found that recognizing that it almost certainly could be worse is actually a helpful coping mechanism. Of course, the coping strategy assumes that you do like your work and the work your company is doing, that you enjoy working with most of your colleagues, and have a non-toxic relationship with your supervisor (or at least some combination of those features) - basically, that you do want to stick with this job but just have to figure out how to make it less emotionally draining.

As you grow and hire more people, ask about their horror stories! (This is especially important if, like me, being an early employee at this company is one of your first jobs.) When you vent to them about what’s going on, ask them to compare this situation to their prior experiences. You might realize that your problems aren’t that unique after all, and hearing that can be really validating. Or you might realize that what you’re experiencing is much worse than you thought, which could shed some important clarity on your situation. You might also be surprised - this might be the best job they’ve ever had! In any case, sharing horror stories can provide you with the perspective that while, yes, what you’re dealing with is difficult, it likely could be worse. I found that being constantly reminded of this made it much easier to deal with the hard stuff.

Focus on the baby steps

One concrete strategy I’ve picked up which has helped me deal with situations where I’m frustrated about something that’s out of my control is to focus on the baby step, not the toddler step. When something happens and I get grumpy because that thing could have been done so much better, instead of focusing on how it could be better (which is how things would be, if we were a toddler company), I try to focus instead of the fact that the thing was done at all (which is the first “baby” step we’re at). For example, if the communication of an important announcement is botched, rather than focusing on how poor the communication was, I try to reframe and emphasize that the announcement was communicated at all! Communicating well is the toddler step; communicating at all is the baby step I’m celebrating instead.

Avoiding burnout

Burnout is real, and avoiding or managing it is critical to coping with the early employee startup rollercoaster. For me, burnout hit early and it hit hard - I was operating at above 100% for over a year, and it’s taken me about the same amount of time to get back to a stable relationship with work. As an early employee in a company experiencing hyper growth, I was a critical piece of so many different aspects of our business operations powering our growth, which I imagine is an experience shared by many early employees. If there’s a foolproof way to avoid burnout, I don’t know it - but I have learned some strategies to process and recover.

First, recognize that you can only do so many jobs. One of my colleagues had a mantra that I found extremely helpful: when she said no to something, or intentionally dropped the ball on something, or explicitly decided not to address a problem at work, she’d say: “I could do that, but then it would be a full-time job and I would die.” When you put it that way, the choice is easy: don’t die.

Second, name all the jobs you’re doing, early and often. As I emerged from my most severe burnout, I found that naming all the jobs I was doing (and very much half-assing) was really helpful. For example, for most of 2022 I was a part-time group lead for one of our data science subteams in addition to being a technical lead, my “actual” job. That meant that when I felt bad for not being good at my group lead role, I would remind myself: it’s just half of my job. So as long as I’m doing it at least half as well as my colleagues, then that’s all we can ask for. Early on, this strategy of naming my jobs actually backfired - I would start writing down everything I did and then become overwhelmed with how many things there were and how impossible it all felt, and end up feeling worse than when I started. But I think that if I had started naming my jobs before they piled up, it would have helped me keep tabs on my growing responsibilities, recognizing which jobs I was letting slide and which ones were critical and therefore needed to be hired for. I think it would have also led to more productive conversations with my manager, helping me advocate for myself in more concrete ways than “I’m stressed, overworked, and burning out.”

This strategy feels obvious in hindsight but it took a while for it to set in for me: work intentionally and on your terms. I’ve had slack off of my phone and the red notifications turned off on my desktop for about a year now, and it’s been life-changing for gaining back control of how I interact with work. Next time I’m in a position where I feel that my company is heading into hypergrowth or I’m creeping toward burnout, I will immediately delete all of my after-work notifications and most of my in-work notifications too. Being principled and intentional about whether and when to work after hours and working on your own terms within working hours is key for managing burnout as an early employee. Of course, it’s important to communicate any changes in availability with colleagues so that they know how to reach you after hours if needed, since startups are unpredictable. But otherwise, focus on strategies to hold yourself accountable to respecting after-work hours. I wish I had removed my notifications much earlier, in part because I think it would have helped me get my mental health back much faster. But equally importantly, I think it would have also made it much more clear to my executive leadership just how much work and tricky troubleshooting we were doing after-hours to keep our daily operations running.

I recognize that it’s really difficult to cut the cord in this way, and in fact it felt impossible for me at the time when I needed it most. A huge part of the excitement of being an early employee is being involved in everything. When things are going well at work, it’s fun to operate with a super flexible schedule like back in grad school. And also like grad school, our identities are wrapped up in the work. Furthermore, hustle culture tells us that we should be working all the time, which is resonates even more strongly as an early employee. And in the case of a startup undergoing rapid growth, then it’s so exciting and nobody wants to miss a thing. But to grow sustainably, startups have to improve their bus factors to not just rely on a handful of passionate early employees and instead grow to have fully functioning and appropriately sized teams. Early employees play a huge role in this transformation, but it only works if we set boundaries and hold ourselves to them. In fact, I believe that a key company milestone that nobody really talks about is achieving redundancy for early employees - the sooner you can get in the habit of not being critical, the sooner your company will get there. So take a vacation, uninstall slack from your phone, close your laptop: remember, you’re actually doing your company a favor.

I hope these strategies and mantras are helpful if you are an early employee struggling with the emotional roller coaster of your experience. If you liked this post, check out the rest of the series on being an early startup employee:

Early startup employee lessons learned, part 1: affecting change

2022-11-19T00:00:00-08:00

The past two years have been a wild ride. The startup I work for went from 5 employees to over 100, one small customer to a multi-million dollar contract with the CDC, one product to over 10, and one customer report per month to hundreds per day - it’s been a lot!

And I’ve learned a lot, but most of it has come the hard way. The past two years have been among the most difficult in my life - yes, a large part of it was the pandemic and the acute experience of our public health systems failing us, but a large part of it was also work. In order to support our company’s growth, a lot was put on my shoulders - getting out from under that weight, and learning to function in the new company we’ve become, has been a huge challenge.

Now that I’m in a good spot and able to reflect on the past two years, I’ve realized that there aren’t a whole lot of resources out there to support early employees on their startup journeys. If you’re a founder, there are fellowships, accelerators, and communities that you can participate in that’ll teach you the nuts and bolts of founding a company and also give you a peek into the emotional rollercoaster you’re lining up for. These networks provide you with a community of peers you can reach out to for support, and perhaps even equip you with some strategies to navigate the founder journey and make it less draining.

But I haven’t seen similar resources targeted for early employees and their experiences. Googling “early employee startup” brings up a handful of blog posts, but they’re primarily focused on strategies to maximize work output and how joining an early startup is a great way to superboost your career growth. If emotional management is mentioned at all, it’s as an aside: “oh, being an early employee is an emotional rollercoaster so make sure you’re ready to handle it. But also think of all the potential career growth!” How to handle the journey isn’t discussed - and there’s even fewer resources if you don’t subscribe to the “work your ass off overtime, make your startup job your whole life” mentality.

Three years ago when I joined a startup as the fifth person on the team, I was naive and excited to have an outsized impact on an exciting woman-led company doing amazing work. And, of course, intrigued by the opportunity to superboost my career growth. But I now understand that I went into my job completely uninformed, swayed by all of the “joining a startup as an early employee is hard, but <1000 word blog post about all the ways it can be amazing for your career>” rhetoric. As things got really hard as we scaled and went through deep growing pains, it hit home for me how little I’d known about what I was sighing up for. And I found that if you’re an early employee at a startup struggling to figure out how to scale with your company while maintaining your sanity and work-life balance, or if you’re already so burned out that you can’t keep hustling but you also don’t want to quit just yet, there doesn’t seem to be much out there for you. (If there is, please point me to it!!) So I wanted to write down some of the lessons I’ve learned.

I’ll caveat all of this with the requisite disclaimer that these are my personal experiences, and that things that work for me may not work for you. Even within my own company, each early employee has taken a unique path, and likely learned different things from their journeys. It’s also important to note that this is my first job out of grad school, and my founders were also fresh out of academia when they started the company. So it’s possible that what I’m about to share is obvious for anybody who’s had a job before, but I also know there are a lot of people in my shoes - excited to take a chance and join a startup they’re passionate about right after finishing grad school.

This will be a multi-part series, starting with part 1 here which focuses on affecting change within an organization. I’ll also touch on coping strategies for the roller coaster, things I’ve learned about building a team, and strategies I’ve picked up for hiring well and hiring fast.

Keep your arms and legs inside the ride at all times, folks - it’s gonna be bumpy!

You’re an employee, not a founder

Building organizations is not something that humans have figured out yet. Unless you have exceptional founders (and even if you do), organizational and systemic failures will abound as your company grows. This is especially true if you’re experiencing hyper growth - there’s just no way to grow that fast without dropping some balls related to company health. My company’s dropped balls with respect to organizational health and culture hit me especially hard.

What’s helped me cope is the realization that unless you’re the founder, you can’t change the company. This has been the most important thing for me to internalize as I’ve navigated my company’s growth. As the company grows, the founders decide everything, including how much impact non-founders can have on the company itself. If they want to bring you in to large decisions where you have a seat at the table, great. But if they don’t, there’s nothing you can do about it. And that’s ok! There are many valid phenotypes of founder, and at the end of the day the founders are the ones who decide what type of company they’re building. You’re just an employee.

That said, you don’t have zero ability to affect change within your company, in fact you have quite a lot! You just can’t make fundamental changes to the company as a whole, unless the founders are also actively on board. I joined my startup in part because I was really excited to help shape the type of company we’d become. When I realized that I wasn’t going to be able to exert influence on company-wide organizational culture, I really struggled. If I’d known going into my job that “shaping the culture” is just as much of a gamble as “cash out big when we go public,” I think I would have struggled a lot less.

Focus on your sphere of influence

Being an early employee at a growing startup puts you at a really interesting nexus of influence. On the one hand, your opinions have more weight than the average employee because of your long tenure and broad context. On the other hand, folks with more seniority and different expertises are being hired above and around you, increasing the number of layers between you and the founders. So making change goes from requiring just swiveling your chair to chat with the CEO sitting next to you to navigating a burgeoning hierarchy strategically and diplomatically. For the first 6-12 months of our hypergrowth, I really struggled with this - I felt like I was wailing into the void about all the things that were wrong and that we needed to change, to no effect. But as we’ve grown our team and hired some colleagues who are much more skilled diplomats than I am, I’ve picked up a couple of strategies to make change effectively in a growing organization.

Most importantly, the change you make must begin within your sphere of influence - the people and teams over whom you have influence, and not the ones outside your reach. As an early employee, your sphere of influence is often the whole company. But as the company grows, that changes - it becomes just your team and maybe also the team adjacent to yours, plus a few additional colleagues who you have strong relationships with. It can be painful to see this dynamic and feel like your sphere of influence is shrinking - but it’s not! Yes, you may go from having influence over 100% of the company to, say, 20% - which is a large number becoming smaller. But actually, it’s highly likely that your sphere of influence goes 5 people to 25 - a 5x increase!

That’s what happened to me - in the early phases of our hypergrowth I maintained my influence over the majority of our growing company because I was handling so many aspects of our day-to-day operations. The founders were the first to leave my sphere of influence, as they focused on capitalizing on this moment to supercharge our company’s growth. As we grew, I had opinions on how our non-technical teams were growing, our market strategy, and so much more that I couldn’t do anything about - they were all things which I had no authority over and more importantly, were all under the purview of people outside my sphere of influence. In contrast, our data science team is well within my sphere of influence. Because of that, it was very easy for me to substantially shape our team’s culture, despite growing to almost 20 people. In fact, our transparent, collaborative, and positive culture is my proudest professional achievement so far. :D

Change starts at home

So does that make changing things outside your sphere of influence a mostly hopeless endeavor? Well, yes and no - you probably can’t change big things directly, but you aren’t powerless to influence your organization. That’s because grassroots efforts can lead to organizational impact. Even though you can’t change how the whole organization works, doing something really well within your own little world can resonate more broadly. It’s possible that other teams will become ready to tackle an issue that you’ve already solved, and come to you for inspiration or advice. Alternatively, folks may notice aspects of your team functioning better than theirs, and reach out to learn how. It can be less satisfying than directly wielding influence because you have to wait for other teams to be ready and in many cases to reach out, but that’s fine if it’s the best you can do. You can’t force anyone to change who isn’t ready to, or convince anyone to listen to you who doesn’t want to.

My favorite grassroots effort that’s led to company-wide adoption is the data science team’s onboarding document, which has become the template for other teams’ onboarding. And our document was initially inspired by the simple existence of the software team’s onboarding document. Our team also hosted a key cross-team training, which has become the model for inspiring other teams to think about formalizing their own cross-team interactions.

Influencing teams outside your sphere

When it comes to other teams that I’m not explicitly on, I’ve learned that making change is all about personal relationships. Even if you’re not on a given team, having strong relationships with key individual can put them in your sphere of influence. And if they then have influence over their team, then you can indirectly have influence through them. Before we had siloed teams, my closest relationships were with folks who had started around the same time as me and my technical colleagues in software and data. Now that we’ve grown to 100 employees with siloed teams, those relationships still carry the most impact and are the primary - and sometimes only - way I can influence other teams.

A concrete example of how I’ve learned to adapt my influence is our culture around async communication: I’m a big proponent of frequent public communication in slack channels, and of sharing unfinished work early and often. (In fact, there’s a :surfacing: slackmoji made just for me!) Other teams at my company have a different culture, and that used to frustrate me so much. But I’m not in their team meetings, I’m not involved in hiring, and I don’t get a say in the culture they’re building - so it’s pointless to get upset or try to change it, when it’s so far outside my sphere of influence. All I can do is slowly nudge folks in the direction of transparency through one on one conversations. Over time, other teams have started making slack channels where they discuss their work publicly and open agenda notes that anyone can look to for async updates. It’s been a slow process and very much a team effort, but I like to think that my one on one conversations have contributed slightly to that cultural change. On my team, in contrast, I explicitly ask about communication in interviews, collaboration is one of our team values, and we force folks to share unfinished work as part of onboarding. As a consequence, the data science team is one of the most open and collaborative at our company. But that’s because the data science team is as much in my sphere of influence as you can get - in fact, I helped write our team values and design our onboarding!

Timing is everything

Finally, timing is everything when it comes to making change in a growing organization. Just because you have all the best ideas for how to grow a team or take your product to market doesn’t mean anything if the timing isn’t right. Maybe there isn’t enough personnel and bandwidth to implement your idea, or you haven’t built up enough conviction for your idea among the right stakeholders, or maybe there’s just some external forces you don’t see holding progress back - if you keep hammering away at your idea in an unreceptive environment, you won’t get what you want and in the process you’ll drive yourself mad and likely frustrate your colleagues too. This is especially important to recognize if you’re in a period of hypergrowth - there will be so many things that could be handled so much better, but it’s likely just so chaotic that folks are already at their max and doing the best they can. It’s hard and it sucks, but you just have to be patient. Luckily, there are so many things you can be doing right now - so it’s important to recognize when the timing isn’t right, and refocus your energy on things that you CAN achieve in the current moment.

So basically, figuring out how to make change at a growing organization is also a lot about all the ways you can’t make change in the organization. But knowing what you can’t do is important to stop wasting energy on sysyphean tasks and instead refocus towards approaches that have a chance of achieving impact: starting small and biding your time, with the hopes that your local impact will ripple outwards and upwards to the rest of your organization.

If you liked this post, check out the rest of the series on being an early startup employee:

Finding free places to camp on a US road trip

2021-09-26T00:00:00-07:00

My parents just bought an RV and asked me to help them find places to camp as they travel. I’m all about finding free campsites not just because it’s a cheap way to see the US, but more importantly because our public lands are a national treasure. Dispersed camping is an amazing way to truly get out in nature, explore beautiful places off the beaten track, and truly benefit from and bask in the natural beauty of the United States.

Dispersed camping is allowed on basically all BLM and National Forest land, unless otherwise specified. When I was on my road trip, I learned how to find campsites. My general process was:

use google maps to find national forests, national monuments, or other conservation areas near where I was going
google the name of the conservation area to get to its BLM or USFS site
look to see if the site mentioned any actual campgrounds or specific areas for dispersed camping
poke around on the site to find and download a geospatial PDF map of the area
if that doesn’t work, look to other sources like freecampsites.net or apps (I used freeroam during my trip, which was decent)

The geospatial pdf’s are especially useful to have, since they’ll often show more detail about roads than google maps has and they work with GPS even when there isn’t service so you know where are you and where you’re going.

Example 1: Sedona, AZ

Let’s do an example! My parents are looking to spend two nights someplace near Sedona, AZ.

First step: google maps. Great news! Sedona has lots of green space nearby, an excellent sign.

Looks like Cococino National Forest is the closets, so let’s start with that one. Googling it takes me to the forest’s site, after which I can go find camping info under the “Recreation” section in the left side-bar.

Funnily enough, clicking on this gets me to a page that has information about a “Digital Travel Map,” which sounds intriguing. But clicking on the “Maps and publications” sidebar gets me to an empty page, womp womp. But sure enough, looks like that Digital Travel Map link takes you to a very useful page where you can download GPS-enable pdf maps of the forest! Huzzah!

Actually, this map is one of the best outcomes of this type of search. The map itself is huge and has a lot of detail, including specific “dispersed camping” indications. I never really understood these because technically dispersed camping is allowed in all National Forests, but I always felt more comfortable camping along roads that were explicitly indicated for dispersed camping. From reading the FAQ on the page where we got the Motor Vehicle Use map, it sounds like these roads are where you are allowed to drive off the road up to 300 feet in order to camp. I’m guessing that other roads allow dispersed camping, but that you just can’t drive off of them to go to your campsite. Given that my parents are gonna be in an RV and not a tent, these are probably their best bet.

Does this restrict where I may camp?
The MVUM does not restrict where visitors may camp on National Forest System lands. However, it does restrict where motor vehicles may be used for the purpose of camping. Use of motor vehicles away from designated roads for the sole purpose of camping is permitted on National Forest System lands up to 300 feet from the edge of a designated road where indicated by the MVUM’s “dispersed camping” symbol . Also, visitors may park alongside any designated road’s edge and walk to their campsite anywhere on National Forest System lands, except where specifically prohibited as indicated in closure orders. When parking along a designated road, drivers must pull off the travelled portion of the roadway to permit the safe passage of traffic.

Anyway, honestly at this point the map is more than sufficient. If you wanted to be extra safe, you could do some cross-checking with google maps satellite view to pick the nicest spot but generally any of the forest service roads marked for dispersed camping will likely be good options.

Just for completeness, let’s also go see what the forest service has to say about campgrounds in this forest. Wow - this forest is well-described! The camping page has so much information about campsites as well as dispersed camping. I especially appreciate the “Sedona Dispersed Camping Guide” pdf – it always made me feel so much better to see dispersed camping explicitly called out (though, of course, I always knew it was allowed).

Example 2: Quartzsite, AZ

My mom told me they also need to find a place to stop for the night around (or east of) Blythe or Quartzsite, AZ.

Again, first stop google maps. This one looks like it might be a bit harder to find spots – I see some green area around the river and then that brown box that’s the Kona Wildlife Refuge. Let’s check both out and see if we can find more info or maps.

First try, the Colorado River Reservation. Seems to be Indian Reservation Land, so probably won’t have any camping available. Let’s poke around their site for just a few minutes and confirm though. Sure enough.

Ok, back to google maps - looks like there’s some green below Blythe. Seems to be the Cibola National Wildlife Refuge. Nothing obvious about camping on their website, so I next googled the name of the refuge plus “camping”. I saw some websites saying that there should be dispersed camping, but I tend to poke around until I find an actual BLM or Forest Service page or map about the place.

So let’s put that on the backburner and keep going east, to the Kona Wildlife Refuge. Same story here, no clear indication of campgrounds or dispersed camping in this area.

This is a situation where I turn to the aggregator websites. Looks like freecampsites.net has a few options in this area, so let’s go through them and see if any look good for my parents.

Clicking around the different free sites, the first thing I look for is if there’s anywhere that’s a BLM campground. Again, these just feel more legit and like less of a wildcard. Looks like there might be one, “Hi Jolly BLM”. So my next step here is to google the campground name itself and see if I can find it on the BLM site. I didn’t ind anything on the BLM site, but the next best thing is there! Looks like there are photos of this site (e.g. on this website), with clear BLM signposts indicating that it’s a legit campsite. It doesn’t look like it’ll be particularly scenic, just a flat dirt parking lot with a bunch of RV’s, but given that my parents will just be passing through this is more than sufficient!

At this point, I would google this campsite and add a star on my google maps, zoom in to the map to see if I can tell if there are roads, and read up on the reviews and descriptions of the camp so I know where to go. That said, I always like to have a couple of options when I’m not equipped with a map of the area of I’m going to, so let’s look for at least one more.

Back to the freecampsites.net site, looks like there’s a spot off of I-10 on Gold Nugget Road. The trick to finding these sorts of sites is to just google the road name that they’re talking about, read reviews, and look at pictures to get a sense for how sketchy or not it might be. In this case, it looks like Gold Nugget Road goes just off of I-10 and then meets up with it again. There are also a handful of reviews on the internet for this road, so I feel like it’s pretty legit.

If you wanted to be EXTRA sure, you could also always try to find the BLM map of the area directly. Avenza has a marketplace of maps, and while it can be a bit painful to find what you need there is often a BLM map of each area that’s available for download. Some maps on Avenza cost money, but all the BLM and Forest Service ones should be free. A search for BLM maps near Quartzsite, AZ yields a few hits that seem to be what you’d need. The goal with these maps is to (1) have a map that works even if you don’t have service and (2) get some more information about the land you’re on. Some BLM land is interspersed with private land, which you don’t want to camp on. So having a map that clearly shows the public land is a great comfort and tool to have!

Resources for pivoting to public health data science

2021-01-09T00:00:00-08:00

About two years into my PhD, I realized that the field I actually wanted to be in was public health, not necessarily biological engineering. Around the same time, I also fell in love with coding and data science. That’s when I realized that combining public health and data science could be an ideal career path for my technical abilities and interests and my desire to have social impact. But immersed in the world of academia, at an institution without a school of public health, and with mentors who had all chosen routes in biotech or academia, it was really hard to learn more about my options for pivoting to a career in public health.

So I started to scrounge around and look for opportunities for a highly-trained technical person like me to pivot into public health and social impact. I never actually ended up pursuing any of these opportunities because my former labmate started Biobot Analytics, which was an obvious career fit for me. But I know not everyone is lucky enough to have a unique opportunity like Biobot, and I’ve had a few folks over the years ask me about transitioning to public health and/or data science from a PhD. So I’ll excavate the list of links and resources I’ve accumulated, in the hopes that they are one day useful to someone. (Again, I didn’t actually apply to any of these (except the Luce), so I can’t speak to what they actually are.)

Public Health Fellowships

CDC EIS

https://www.cdc.gov/eis/index.html

Epidemic Intelligence Service, 2-year fellowship for doctors (healthcare professionals and PhDs). You get placed in a CDC office (you don’t get a say in where you get placed, I think) and work on the front lines of epidemiological response.

From informational interviews and asking around, seems that EIS is pretty legit and prestigious in public health, and are a very common way for non-MPH’s to break into a career path at the CDC.

CDC Fellowships

https://www.cdc.gov/fellowships/full-time/index.html

CDC has a list of many fellowships for bachelor’s, master’s, and PhD-level candidates. Actually, now that I look through this again this is probably the best place to start.

ORISE Fellowships

https://orise.orau.gov/internships-fellowships/index.html

Fellowships across a broad variety of government agencies (including the CDC), available at the undergraduate, graduate, and postdoctoral levels.

APHL-CDC Fellowships

https://www.aphl.org/fellowships/Pages/About-the-Fellowship-Program.aspx

A few different types of fellowships available with the Association of Public Health Laboratories, including a mix of lab and computational fellowships. Looks like the fellowship descriptions vary slightly, but all involve some placement in a state, local, or federal public health laboratory to do real-world public health work.

Postdocs with a public or global health focus

Big Data-Scientist Training Enhancement Program (BD-STEP)

https://www.va.gov/oaa/specialfellows/programs/sf_bdstep.asp

Looks like this is a VA-sponsored fellowship that you can apply for at multiple locations (and presumably projects). Seems like a semi-generic postdoc, except that I imagine you have excellent access to cool VA data.

Fulbright-Fogarty Fellows in Public Health

https://us.fulbrightonline.org/about/types-of-awards/fulbright-fogarty-fellowships-in-public-health

Looks like this is a public health-specific Fulbright.

Global Health Program for Fellows and Scholars

https://www.fic.nih.gov/Programs/Pages/scholars-fellows-global-health.aspx

12-month research fellowships in low- and middle-income countries, administered through Harvard, UC Berkeley, University of California Global Health Institute, UNC, University of Washington, and Vanderbilt. I’m guessing you apply directly through one of the participating institutions, and I imagine it’s a fairly generic postdoc fellowship.

International Research Scientist Development Award (IRSDA)

https://www.fic.nih.gov/Programs/Pages/research-scientists.aspx

Funding for a postdoc or junior faculty to do research in a low- or middle-income country.

International fellowships

These aren’t public health-specific, but you can swing a pivot into a new field with these.

Luce Scholars

https://www.hluce.org/programs/luce-scholars/

Not public health-specific, but how could I not include the best fellowship in the entire world? :)

The Luce Scholars program places you in a job in an Asian country for a year. No requirements beyond beyond smart and driven, having a degree from a qualified institution, being a US citizen under 30, and having had little to no exposure to Asia. If you get it, you can go work at a local public health agency or public health-focused NGO.

Princeton in Asia

https://piaweb.princeton.edu/about-us

Like the Luce but less competitive, more participants, and less well-paid. But also has a lot more options than the Luce: the Princeton in Asia program basically funds a bunch of internships all across Asia. Lots of public health options here.

From my experience in Cambodia, PIA is far less structured than the Luce (you’re basically just in an internship on your own) but you have more co-fellows in your country so it’s easier to find community.

Gates Foundation Global Health Fellows

https://www.gatesfoundation.org/Careers/Gates-Fellowships-FAQ

Looks like the fellowship is currently on pause, but intended to relaunch by 2022.

Global health corps

https://ghcorps.org/

Seems similar to Princeton in Asia in that there are many placements to choose from (only in Rwanda, Uganda, Malwai, and Zambia).

Fulbright

https://us.fulbrightonline.org/

Of course, there’s always a Fulbright. I wasn’t ever interested in pursuing a Fulbright because you have to come up with your own project (and I’ve heard you get basically zero support while in-country), but it’s definitely an option for folks who already have a clear idea of what they want to do.

Data science fellowships and postdoc funding

https://www.dssgfellowship.org/

One of the first and most well-regarded data science for social good fellowships. Spend a summer working closely with governments and non-profits to apply data science to have real-world social impact. Not directly public health, but I’m sure many projects are health-related!

https://www.ibm.com/ibm/responsibility/initiatives/IBMSocialGoodFellowship.html

From the website: “The IBM Social Good Fellowship is an opportunity for graduate students and postdoctoral scholars to develop their skills and develop data science solutions that benefit humanity.”

BIDS Data Science Fellows Program

https://bids.berkeley.edu/call-data-science-fellow-applications

Funding for a 2-year fellowship at the Berkeley Institute for Data Science (BIDS).

Columbia Data Science Institute postdoctoral fellowships

https://datascience.columbia.edu/research/postdoctoral-fellows/

Looks like a generic postdoc fellowship to work with Columbia DSI faculty.

Also in general, lots of institutions are starting up data science institutes which usually come with data science-specific opportunities.

Schmidt Science Fellows

https://schmidtsciencefellows.org/

Kind of like the Luce, but if it were a postdoc and for scientists wanting to pivot to a different field.

Policy fellowships

AAAS

https://www.aaas.org/programs/science-technology-policy-fellowships

The classic. AAAS Science Policy Fellowships are an excellent way to get hands-on experience working on science policy issues in the federal government. The congressional (legislative) fellowship supports two people each year to go work in the office of a member of Congress (my friend just finished it, she worked with Ed Markey!). The executive fellowship has a lot more openings, and there you can work in basically any federal agency.

The Christine Mirzayan Science & Technology Policy Graduate Fellowship Program

https://mirzayanfellow.nas.edu/Default.asp

I actually know very little about this one, though from the website looks like it’s at the National Academies of Sciences, Engineering, and Medicine in DC and for only 12 weeks.

Other federal opportunities

Honestly, the best thing you can do is to sign yourself up for as many emails from federal agencies as possible.

I think once you enter into one agency’s email subscription management service, it’ll give you options to sign up to other agencies’ emails as well. I find it easiest to start with the CDC emails and go from there. NIAID and NIH emails have been mostly useful, but there is a whole treasure trove of federal agencies with intriguing sounding names!

These emails are a good way not only to get notified of potential cool opportunities (including data science!), but also just to better understand that landscape of federal agencies beyond the CDC.

Public Health Service Corps

https://www.usphs.gov/

Did you know that there is an official uniformed service corps in the US Public Health Service? This is the group that’s led by the Surgeon General. I have no idea what applying or working for the public health service corps entails, but it’s good to know it exists!

Data science at the NIH

https://datascience.nih.gov/workforce-development/fellowship-job-opportunities

Looks like there’s a handful of data science-related opportunities available through the NIH’s Office of Data Science Strategy.

18F

https://18f.gsa.gov/join/

18F works with government at many levels to modernize their software development.

US Digital Service

https://www.usds.gov/apply

Seems similar to 18F, but with a bit more emphasis on transforming existing tools and processes. This blog post describes 18F as “build it / buy it” and USDS as “fix it”.

Presidential Innovation Fellows

https://presidentialinnovationfellows.gov/

“Embedded within agencies as “entrepreneurs in residence” for one year, our fellows bring the best of data science, design, engineering, product, and systems thinking into government.” The blog post above describes this as “Try it”.

Job boards

These job boards are for a mix of GovTech, political tech, public health, computational biology, and related.

USAJobs (https://www.usajobs.gov/): federal government’s official employment site.
US of Tech (https://www.usoftech.org/): more IT and software development focused, US of Tech is trying to get skilled technical folks into government. Job and internship opportunities across a variety of agencies.
Outer Join (https://outerjoin.us/): recently found this one. Not focused on anything government or public health, just generic data science postings. All are supposed to be remote-friendly.
Progressive Data Jobs (https://www.progressivedatajobs.org/): data science jobs in progressive and Democratic campaigns and organizations.
Higher Ground Labs (https://jobs.highergroundlabs.com/): another job board for progressive tech
Jobs That Are Left (https://groups.google.com/g/jobsthatareleft?pli=1): Google group email list for a bunch of jobs in progressive spaces. Majority of the jobs are to work on campaigns, but I’ve also seen quite a few interesting data science jobs come through this list.
All Hands (https://www.all-hands.us/): looks to be less of a job board and more of a recruiting site. You submit your resume to join the talent pool, and then they connect you to jobs? No idea how effective this is.
Coding it Forward (https://www.codingitforward.com/): has an email list with weekly job drops. Focused a little more broadly than the other political tech ones, they focus on social impact and civic technology.
Fast Foward Tech (https://www.ffwd.org/tech-nonprofit-jobs/): focused on tech nonprofits.
IDD Jobs (https://iddjobs.org/): job board for fields related to infectious disease dynamics, which overlaps considerably with epidemiology and public health-relevant worlds.
Code for America job board (https://jobs.codeforamerica.org/search): job board for Code for America, opportunities in public interest tech.

Racism as a public health crisis: how wastewater epidemiology fits in

2020-06-10T00:00:00-07:00

Today is the Strike for Black Lives and a day to #ShutDownSTEM. For white people like me, today is about recognizing and reflecting on the anti-Black racism in our society, and committing to specific actions toward ending white supremacy. One of my actions for today is to publicly reflect on how our work at Biobot Analytics contributes to addressing – and potentially perpetuating – racism in public health.

I’ll be going through this excellent Washington Post opinion piece by Dr. Michelle Williams (Dean of the faculty at the Harvard School of Public Health) and Jeffrey Sánchez (former MA State Rep and fellow at HSPH):

Racism is killing black people. It’s sickening them, too.

I read this the other day and saw that wastewater epidemiology has a role to play in essentially every issue brought up in the piece. I’ve been thinking about many of these issues for a while now, but haven’t ever written them down. Hopefully in doing so, I can plant the seed for new ideas or encourage existing ones to grow, sparking conversations within my own company and the broader wastewater epidemiology community.

Across the country, black Americans suffer from higher rates of diabetes, hypertension, asthma and heart disease than white Americans. They are more likely to be obese and get insufficient sleep, which can contribute to such health issues. The role of racism in these underlying conditions cannot be denied.

A growing body of literature shows that social determinants — otherwise known as the conditions in which we’re born and in which we live, work and play — are key drivers of health inequities. For generations, communities of color have faced vast disparities in job opportunities, income and inherited family wealth. They are less likely to have housing security and access to quality schools, healthy food and green spaces. All these factors undoubtedly undermine mental and physical well-being.

One of the most impactful aspects of looking to sewage as a source of health information is that everybody pees. Regardless of access to healthcare, economic opportunity, education level, or anything else – everybody pees. And if you’re like one of the majority of Americans who is connected to sewage infrastructure, then the health information you flush down the toilet is accessible through city sewers. That means that we can use sewage to monitor the health of people who might not have access to healthcare for any variety of reasons, and who therefore aren’t traditionally captured in clinical statistics.

Many crucial social determinants of health are difficult to quantify and therefore study. One of the other things that I find so exciting about wastewater epidemiology is that you could use it to measure these factors and open up new avenues of research and evidence-based policy making. For example, using wastewater to monitor community-level nutritional intake could change the way we identify and study food deserts, and directly quantify the impact of fresh food programs on the local communities who are intended beneficiaries.

Racism-associated stress and its biological consequences

In addition to the consequences of structural racism, it is well-documented that racism itself is hard on a person’s health. Chronic stress caused by discrimination can trigger a cascade of adverse health outcomes, from high blood pressure and heart disease to immunodeficiency and accelerated aging. Evidence even suggests that the racism endured by black mothers contributes to the alarmingly high maternal and infant mortality rate.

As a bioengineer, it’s wild that my training never covered the biological effects of racism-induced stress compounded over a lifetime. There is certainly a large body of research on health issues linked to racism-related stress, but a disproportionate amount of biomedical science is focused on finding genetic markers to explain different rates of disease in sub-populations like racial groups. That had always annoyed me as a scientist uninterested in human genetics, but even more so when I realized that there was this whole other body of research that our field could have been prioritizing instead. And when you zoom in on how these stressors affect health outcomes for Black mothers in this country, the tragedy really crystallizes.

What if we measured biological markers of stress at a community-level through sewage? We could use wastewater epidemiology to show the extent of the biological impact of racism, for example by comparing stress markers in heavily policed communities vs. those with community-led neighborhood watches. Maybe sewage could open up a whole new field of research, directly measuring the biological effects of our racist and unjust society, and paving the way for improvements that rectify and reverse these negative impacts.

Essential workers and unequal access to economic opportunity and public health prevention

Black and brown Americans make up a disproportionate number of essential workers who have stayed on the job through lockdowns, and thus are at higher risk of contracting the disease. And when they do fall ill, they are more likely to receive worse care than white Americans do. That’s true even when controlling for socioeconomic factors such as income and education.

The burden of COVID-19 is not evenly distributed, and neither is the ability to implement preventative measures like staying home from work. That’s resulted in extremely disparate impacts, with Black people and other communities of color bearing a much greater share of COVID-19 deaths than their distribution in the population.

Here again, wastewater epidemiology could provide a quantitative and direct way to measure and monitor these disparities. By moving measurements upstream and into city manholes, we could identify new surges of COVID-19 on a community-by-community basis, mobilize testing centers to the areas where they are most needed, and make sure that even if certain communities aren’t being tested, they are being counted and served.

And it’s not just COVID-19 where this line of reasoning applies: with opioids, we’ve also realized that wastewater monitoring could be leveraged to identify communities who are experiencing high levels of opioid use and even overdoses (determined by measuring Narcan, the overdose reversal treatment) but who aren’t calling first responders and therefore have very low overdose numbers captured in official statistics. Thinking about quantifying these sorts of “treatment gaps” through wastewater could provide public health and city officials with yet another tool to address disparities within their local communities.

Environmental racism

Environmental racism is another topic that I’m baffled was never covered in any of my scientific training. Across the country, low-income communities of color are more likely to have factories and other sources of pollution built near them, further exacerbating health disparities. The disproportionate exposures to pollution faced by low-income communities of color are not passive mistakes, but rather the result of a systemically racist society.

This is another area where I’m excited by the potential of wastewater epidemiology to contribute to how these issues are studied, monitored, and improved. For example, measuring biomarkers of exposures to pollutants could complement associative studies linking toxic exposures to long-term health outcomes in individuals living in or from communities most affected by environmental racism. Imagine if the EPA’s metrics controlling what factories are and aren’t allowed to dump in the water weren’t about how much the factories were dumping, but rather about the direct health effects they were having on nearby populations.

Wastewater epidemiology as a potential tool of oppression

I’m excited about the prospect of sewage-based monitoring as a tool for quantifying health inequities by directly measuring the biological impacts of racist systems on individuals and communities. But I recognize that as with all other emerging technologies, this one is not without risks.

Yes, we could use wastewater epidemiology to shine a brighter light on social determinants of health and establish direct links between socioeconomic conditions and health outcomes. But we could also use wastewater epidemiology to entrench stigma and justify inequitable policies. For example, you could imagine insurance companies using sewage-based indicators as “objective” measures of community health, and varying premiums based on which neighborhood you live in. I could absolutely see an argument being made that such sewage-derived metrics are “objective” measures free from bias and therefore legitimate to act on. But it is clear that such measures would just be thinly veiled proxies for existing inequities.

Yes, wastewater epidemiology could be used to highlight the shockingly high rates of COVID-19 in communities with many low-income, non-white essential workers. And it could be used as an early warning for reemergence of COVID-19 cases on college campuses, thus providing administrators with finer and more responsive control over when to implement control measures. Or it could be used to justify unsafe return-to-campus or return-to-work policies, wherein the absence of COVID-19 in the sewers would justify the “safety” of forcing workers back to work even if they do not feel safe doing so.

And finally, even though sewage-based monitoring has the potential to revolutionize how we monitor the health of the majority of Americans, there’s still a non-negligible portion of the population that is not serviced by sewers. As we advocate for additional federal funding to integrate wastewater-based monitoring into standard public health practice, we must recognize which populations will be excluded. Even in the US, sanitation is not a solved issue. Those without access to sewer systems may also be those with the least access to public health services. Whether our work serves to increase these inequities or decrease them is up to us.

Making change goes beyond sewage

Which brings me to my last point. At the end of the day, wastewater epidemiology isn’t going to solve any of these societal issues. Sewage isn’t going to tell us anything we don’t know: we don’t need wastewater epidemiology to know that racism is bad and that it contributes to health disparities. At its best, wastewater epidemiology will provide additional concrete evidence to motivate change and actionable metrics to quantify improvements. At worst, it will be deployed thoughtlessly and in ways that further entrench existing disparities. It’s up to us, the technology leaders and entrepreneurs working to integrate wastewater epidemiology into standard public health practice, to make sure that doesn’t happen.

Mapping my cross-country road trip with Python

2020-02-09T00:00:00-08:00

When I was on my cross-country road trip last year, I kept track of a lot of things in the hopes of doing amazing analyses when I got back. Turns out having a job as a data scientist makes it a lot harder to find time to do data science on the side, and so I’ve only really gotten a chance to look into my expenses. One of the things I had really wanted to do was make a map of all the places I went, but I didn’t actually know how to work with geospatial data in Python yet. I had this grand idea that I’d learn spatial data techniques while on my trip, but turns out hiking, drinking beer while watching the sunset, and going to bed early were way more compelling ways to spend my time. Luckily for me, one of the most fun parts of my new job has been learning geospatial coding and plotting techniques. I’ve been having fun with it in my job, and I realized that it also means I get to finally make my map!

In this post, I’ll go over geocoding (which I’ve never done before!), plotting points and lines on a map, and adding a background map. I will note that I’m still very early in my geospatial analysis days and most of these coding tricks are things I’ve picked up in the last couple of months, so I’m sure future-me will revisit this post in a few years and cringe at some of the coding choices I’ve made. But who cares, I wanna make the map! Let’s do it!

import pandas as pd
import geopandas as gpd

import contextily as ctx
import shapely

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

from time import sleep # to limit geocoding rate in for loops

import matplotlib.pyplot as plt
from matplotlib.colors import to_hex
import seaborn as sns
%matplotlib inline

For this map, we’ll use the dataset where I tracked my mileage. Because I am a human and not a robot, I wasn’t super consistent about it and didn’t always manage to get the mileage at each place I stayed at, for example. I also sometimes wrote down mileage multiple times a day, especially on days when I drove a long time. But I made up for it whenever I could, and I think I got a pretty decent coverage over the whole trip, with at least the majority of my camps tracked. Here’s what the data looks like:

df = pd.read_excel('mileage.xlsx')
# The last two columns are extra, drop those
df = df.iloc[:, :-2]
print(df[['city', 'state']].drop_duplicates().shape)
df.head()

	day	city	state	mileage	time	part
0	2019-02-17	san diego	ca	41491.0	midday	1
1	2019-02-17	sonoran desert national monument	az	41889.0	night	1
2	2019-02-18	mesa	az	41948.0	midday	1
3	2019-02-18	tucson	az	42067.0	evening	1
4	2019-02-18	indian bread rocks blm	az	42179.0	night	1

Because I tried to track my mileage every night but sometimes forgot and got it in the morning, I included a column indicating what time of day-ish that mileage occured at. That’s what’s in the time column. The part column on the right indicates whether it was part 1 (San Diego to Boston) or 2 (Boston to San Diego) of my road trip, and everything else is pretty self-explanatory.

Geocoding locations

Before I can make a map, I need to get the actual locations of the places I wrote down. From some poking around, I found that the geopy library has a wrapper for many different types of geocoders. I went ahead and tried using the one called Nominatim, because it seems to be fairly standard and doesn’t require an API, but if I wanted to geocode more locations I’d be much more careful about reading up on the terms of each provider and making sure I wasn’t abusing their service. But since I only have about 70 places to geolocate, I’ll just stick with this one.

First, I need to set up my geolocator object. Then, I’ll get just the unique cities that I wrote down in my spreadsheet, geolocate them, and put them back into my original dataframe.

# Set up geolocator object
# Increase the timeout so I don't get timeout errors as much
geolocator = Nominatim(user_agent="road-trip-map", timeout=10)
# Make sure to limit requests to abide by terms
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# Get the unique cities in my dataframe
cities = df[['city', 'state']].drop_duplicates()
# Convert the two columns city and state into one string
cities['to_geolocate'] = cities['city'] + ' ' + cities['state']

# Geolocate the cities - this takes a couple of minutes
cities['location'] = cities['to_geolocate'].apply(geocode)

Let’s take a look at what this looks like:

cities.head()

	city	state	to_geolocate	location
0	san diego	ca	san diego ca	(San Diego, San Diego County, California, Unit...
1	sonoran desert national monument	az	sonoran desert national monument az	(Sonoran Desert National Monument, Riggs Road,...
2	mesa	az	mesa az	(Mesa, Maricopa County, Arizona, United States...
3	tucson	az	tucson az	(Tucson, Pima County, Arizona, United States o...
4	indian bread rocks blm	az	indian bread rocks blm az	(Indian Bread Rocks Picnic Area, Happy Camp Ca...

I’m super impressed at this geocoder, it even got the random BLM campground I stayed at in Arizona (Indian Bread Rocks)! I remember this one well - I was super worried about it because I’d be getting in after dark and there weren’t too many backup options in this stretch of road, and the first random BLM campground I’d stayed at was quite difficult to find (Google maps took me down some dirt road straight to a “no trespassing” fenced-in compound, and then when I got back on the correct dirt road I took a wrong turn and got myself into some trees. Sorry for the scratches on the car, Dad!) Anyway, this camp was way better - after some delicious tacos in Tucson, I remember that it was super easy to find and well-signed, and there was an open spot for me, a toilet, and even some cell service! I snuggled up with a beer and a funny family group chat, and when I woke up the next morning everything was covered in snow. I think it was the first morning where I truly felt like I was on a road trip, I knew what I was doing, and it was gonna be amazing.

All that to say, it’s cool that the geocoder found it. Let’s see if there are any places it couldn’t find, and if I can fix those manually:

cities[cities['location'].isnull()]

	city	state	to_geolocate	location
6	big bend hot springs	tx	big bend hot springs tx	None
19	great smoky mountains national park	tn	great smoky mountains national park tn	None
24	mississippi river state park	tn	mississippi river state park tn	None
26	haw creek falls camp	ar	haw creek falls camp ar	None
31	lake clayton state park	nm	lake clayton state park nm	None
33	joe skeen blm campground	nm	joe skeen blm campground nm	None
38	south rim grand canyon national park	az	south rim grand canyon national park az	None
50	grand staircase escalante	ut	grand staircase escalante ut	None
55	ken's lake campground	ut	ken's lake campground ut	None
57	rabbit valley jouflas campground	ut	rabbit valley jouflas campground ut	None
77	norris junction yellowstone national park	wy	norris junction yellowstone national park wy	None
78	mammoth campground yellowstone national park	wy	mammoth campground yellowstone national park wy	None
79	old faithful yellowstone national park	wy	old faithful yellowstone national park wy	None
80	turpin meadow camp	wy	turpin meadow camp wy	None
81	grand tetons national park	wy	grand tetons national park wy	None
86	catnip reservoir sheldon wildlife refuge	nv	catnip reservoir sheldon wildlife refuge nv	None

Interesting, lots of campgrounds within parks here that I suppose make sense are hard to find, but others which are more surprising (e.g. “Grand Tetons National Park” – what’s going on there?).

I’ll just go through each of these one by one and try to figure out what’s causing the issue. In some cases, I imagine just zooming back out to the park will be enough. I’ll just try variations of each unfound string in the geocoder, and then check if the location it returns is the one I wanted. I won’t include all the code with my back-and-forths here, but I will track what the problem was in the comments in the code block below:

# This dictionary maps "current location name": "correct location name"
fixgeolocate = {
    'big bend hot springs tx': 'big bend national park tx', # too specific
    'great smoky mountains national park tn': 'great smoky mountains national park', # the coder has this in NC
    'mississippi river state park tn': 'st francis national forest', # this place had two names, v confusing
    'haw creek falls camp ar': 'haw creek falls', # this is the one I almost got flooded in, in the Ozarks
    'lake clayton state park nm': 'clayton lake state park nm', # messed this up everytime, beautiful place tho
    'south rim grand canyon national park az': 'grand canyon national park az', # too specific
    'grand staircase escalante ut': 'Grand Staircase-Escalante National Monument', # too lazy to write it all out
    'rabbit valley jouflas campground ut': 'rabbit valley co',
    'norris junction yellowstone national park wy': 'norris geyser wy', # too specific
    'mammoth campground yellowstone national park wy': 'mammoth campground wy',
    'old faithful yellowstone national park wy': 'old faithful wy',
    'turpin meadow camp wy': 'turpin meadow campground wy',
    'grand tetons national park wy': 'grand teton national park wy', # typo, guess there's only one teton
    'catnip reservoir sheldon wildlife refuge nv': 'catnip reservoir nv',
    'joe skeen blm campground nm': 'BLM El Malpais',
    "ken's lake campground ut": "Ken's Lake", # camp that took forever to find outside of moab...

}

So turns out there’s only one Teton in the park name (even though there are multiple mountains!). The geocoder also had trouble finding the Joe Skeen campground at El Malpais, because it’s apparently got a typo in whatever database it’s pulling from (it has is listed as “Jpe Skeen” womp womp). Honestly it was one of the best free campgrounds I found – I stayed there two nights, after I unsuccessfully tried to find a spot at the much smaller (and more popular) campground at El Morro monument. Also I chuckled a bit that the geocoder couldn’t find the Ken’s Lake campground, because I also had a lot of trouble finding it! It was a campground outside of Moab, which I paid something like $20 for after driving around for hours trying to find a place to camp. Note to future travellers: when it comes to Moab, book ahead!

Anyway, let’s geocode these updated location and merge it back with the rest of the location results.

# Get location for each of these places
new_locations = {}
for key, val in fixgeolocate.items():
    # Geocode the updated location
    loc = geocode(val)
    # Build a map of old location --> geocoded location
    new_locations[key] = loc

    # Add a delay to not overload the geocoder
    sleep(1)

# And update the dataframe

# Get the indices of the rows that have "None" for the location
null_locs = cities[cities['location'].isnull()].index

# Replace the "location" value in those rows with the new location
# in the new_locations dict
cities.loc[null_locs, 'location'] = cities.loc[null_locs, 'to_geolocate'].map(new_locations)

Now that I have everything geocoded, I’ll merge the locations back onto my original dataframe, which contains the day, part of the trip, and mileage.

# Merge geocoded locations with the original dataframe
gdf = pd.merge(
    df, cities[['city', 'state', 'location']],
    on=['city', 'state'],
    how='left'
)
gdf.head()

	day	city	state	mileage	time	part	location
0	2019-02-17	san diego	ca	41491.0	midday	1	(San Diego, San Diego County, California, Unit...
1	2019-02-17	sonoran desert national monument	az	41889.0	night	1	(Sonoran Desert National Monument, Riggs Road,...
2	2019-02-18	mesa	az	41948.0	midday	1	(Mesa, Maricopa County, Arizona, United States...
3	2019-02-18	tucson	az	42067.0	evening	1	(Tucson, Pima County, Arizona, United States o...
4	2019-02-18	indian bread rocks blm	az	42179.0	night	1	(Indian Bread Rocks Picnic Area, Happy Camp Ca...

Getting ready to plot, plus some more wrangling

Before I can plot these locations, I need to convert them to shapely objects and put them in the geometry column of the geopandas dataframe. I’ll do this using the points_from_xy() function in geopandas, after grabbing the latitude and longitude from the locations returned by geopy. The geopy location class has latitude and longitude as two properties, so it’s super easy to do this!

# query each location's latitude and longitude
gdf['latitude'] = gdf['location'].apply(lambda x: x.latitude)
gdf['longitude'] = gdf['location'].apply(lambda x: x.longitude)

# Add the shapely Point in the 'geometry' column
gdf['geometry'] = gpd.points_from_xy(gdf['longitude'], gdf['latitude'])

# And convert the dataframe to a geopandas dataframe
gdf = gpd.GeoDataFrame(gdf)

Almost ready to plot! I want to also add an easy USA background to my plot, otherwise it’ll just be points in a blank axis. Let’s use some of the built-in geopandas maps to do this.

# Read in the world - this returns just a normal geodataframe
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Get just the US
usa = world.query('name == "United States of America"')

Ok, now I’m ready to plot where I went!

# Plot the USA
ax = usa.boundary.plot()
# Overlay my point locations
gdf.plot(ax=ax)

Uh oh. There’s one point that’s suuuper not right (and one that’s in … Canada?). Wonder what’s happening here…

# Fix the one in the bottom right first
gdf.sort_values(by=['latitude', 'longitude']).head(1)

	day	city	state	mileage	time	part	location	latitude	longitude	geometry
27	2019-04-18	hot springs	ar	NaN	afternoon	2	(aguas termales, La Peña Azul, Municipio de Ge...	-24.238173	-64.159278	POINT (-64.15928 -24.23817)

Ah, seems that Hot Springs, Arkansas got geocoded somewhere in Argentina… Let’s fix that manually and re-plot!

# I manually checked that this returns the Hot Springs in Arkansas
hotsprings = geocode('hot springs')
gdf.at[27, 'location'] = hotsprings # need to use at to avoide error bc "hotsprings" is a tuple
gdf.loc[27, 'latitude'] = hotsprings.latitude
gdf.loc[27, 'longitude'] = hotsprings.longitude
# Re-code the lat/lon into a point
gdf.loc[27, 'geometry'] = shapely.geometry.Point(gdf.loc[27, 'longitude'], gdf.loc[27, 'latitude'])

# Fix the one in Canada next
gdf.sort_values(by=['longitude'], ascending=False).head(3)['location']

  (Oakland, Municipality of the District of Lune...
  (Oakland, Municipality of the District of Lune...
  (Boston, Suffolk County, Massachusetts, United...
Name: location, dtype: object

And it found Oakland as the Oakland in Nova Scotia… Let’s fix that too.

# I manually checked that this returns the Hot Springs in Arkansas
oakland = geocode('oakland ca usa')
for ix in [93, 94]:
    gdf.at[ix, 'location'] = oakland # need to use at to avoide error bc "hotsprings" is a tuple
    gdf.loc[ix, 'latitude'] = oakland.latitude
    gdf.loc[ix, 'longitude'] = oakland.longitude
    # Re-code the lat/lon into a point
    gdf.loc[ix, 'geometry'] = shapely.geometry.Point(gdf.loc[ix, 'longitude'], gdf.loc[ix, 'latitude'])

There’s a couple other mistakes that I found after plotting, which I’ll just fix here without explaining step by step:

def update_loc(ix, newloc):
    gdf.at[ix, 'location'] = newloc
    gdf.loc[ix, 'latitude'] = newloc.latitude
    gdf.loc[ix, 'longitude'] = newloc.longitude
    # Re-code the lat/lon into a point
    gdf.loc[ix, 'geometry'] = shapely.geometry.Point(gdf.loc[ix, 'longitude'], gdf.loc[ix, 'latitude'])

# My original Joplin campground coded somewhere in West Virginia
ix = 28
newloc = geocode('joplin campground')
update_loc(ix, newloc)

# Wyoming ('wy') got read in as "way", so these two are in the wrong place
ix = 73
newloc = geocode('sundance wyoming')
update_loc(ix, newloc)

ix = 76
newloc = geocode('cody wyoming')
update_loc(ix, newloc)

# Re-try plotting!
# Plot the USA
ax = usa.boundary.plot()
# Overlay my point locations
gdf.plot(ax=ax)

Yes! No more side-trips to Argentina (alas). Okay, now that I have everything set up to make this map, let’s work on making it prettier! I have two main goals here, beyond simple aesthetics:

Add a more informative background map
Connect the dots with lines, showing my actual trajectory

I have a pretty good idea of how to do #1, since I’ve been recently playing around with contextily for adding basemaps. It was a bit of pain to install (I needed to play around with setting conda channel priorities), but now that I have it working it’s extremely useful.

For #2, I’ll need to do some significant wrangling with shapely to convert subsequent points into lines.

Adding a background map

Let’s start with #1! I’ll first need to set the projection of my current data to the WGS84 projection (EPSG:4326), which is what latitude and longitudes are encoded in. For those of y’all not used to working with maps, projections are basically different ways of encoding 3D information (i.e. the Earth) onto a 2D space (i.e. a computer screen). You’ve heard of this before certainly, the Mercator projection is the one that makes Greenland look huge and Africa look tiny. There is lots of scholarship and internet debate on different projections, I’m sure, so I won’t even start down that rabbit hole! (Though if you are gonna start down this rabbit hole, I recommend looking into how different map projections change how we view the world, and contribute to colonialism by emphasizing some geographies over others).

Anyway. Mostly what I know is how to work with projections in order to get the maps that I need out of my code.

# Set the projection to WGS84
gdf.crs = {'init': 'epsg:4326'}
# Modify projection to match what contextily uses
gdf = gdf.to_crs(epsg=3857)

# Plot the points
ax = gdf.plot(figsize=(10, 5))

# Add basemap
ctx.add_basemap(ax=ax)

Wowza, our first map!! Though I’d like to have the state outlines on this map as well. It’ll give some structure to the map, and also expand it to the whole US rather than just the band that I visited. From a quick google, seems that this file should do the trick.

states = gpd.read_file('shapefiles/states.shp')
# Keep only the 50 continental states
dropstates = ['Hawaii', 'Alaska']
states = states.query('STATE_NAME != @dropstates')

# Convert states to the right projection for contextily
states = states.to_crs(epsg=3857)
states.head()

	STATE_NAME	DRAWSEQ	STATE_FIPS	SUB_REGION	STATE_ABBR	geometry
1	Washington	2	53	Pacific	WA	MULTIPOLYGON (((-13625730.016 6144404.934, -13...
2	Montana	3	30	Mountain	MT	POLYGON ((-12409387.580 5574754.285, -12409986...
3	Maine	4	23	New England	ME	MULTIPOLYGON (((-7767570.862 5476923.993, -777...
4	North Dakota	5	38	West North Central	ND	POLYGON ((-10990622.005 5770462.676, -11021390...
5	South Dakota	6	46	West North Central	SD	POLYGON ((-11442350.599 5311256.999, -11466561...

# Re-plot everything, adding on the state outlines
fig, ax = plt.subplots(figsize=(10, 5))

# Plot the outlines
states.boundary.plot(color='black', linewidth=0.5, ax=ax)

# Plot the points
gdf.plot(ax=ax)

# Add basemap
ctx.add_basemap(ax=ax)

# Remove axes
ax.set_axis_off()

Ah! There it is!!! It’s my road trip!!!

Okay, before I dive into making trajectories, can I just color each point by their day on my trip? I sometimes have multiple entries per day, so I’ll need to create a map from day to number. I’ll just increment each day that I wrote down, regardless of whether they’re consecutive days or not. Since I’m just going to visualize the day using colors, all I care about is the relative days.

gdf = gdf.sort_values(by='day')

days = gdf['day'].unique()
daysdict = dict(zip(days, range(len(days))))

# Use dict to map into column
gdf['day_number'] = gdf['day'].map(daysdict)

# Re-plot everything, adding on the state outlines
fig, ax = plt.subplots(figsize=(10, 5))

# Plot the outlines
states.boundary.plot(color='black', linewidth=0.5, ax=ax)

# Plot the points
gdf.plot(ax=ax, column='day_number')

# Add basemap
ctx.add_basemap(ax=ax)

# Remove axes
ax.set_axis_off()

Converting points to trajectories

Ok, now to convert these points into actual lines corresponding to my travels. I’ll make two lines: one for the first part of my trip (San Diego to Boston), and one for the second (Boston to San Diego, with lots of fun detours in between).

First, I need to sort my trip by day. Because on some days I wrote down multiple mileages, I also need to sort by time of day. I’ll do this by setting my time column as an ordered categorical, since I only tracked the rough time of day (i.e. “morning” and “night”).

# Set the column as categorical type
gdf['time'] = gdf['time'].astype('category')
# Manually specify the order
order = ['morning', 'midday', 'afternoon', 'evening', 'night', 'allday']

# Set categories as ordered dtype
gdf['time'].cat.set_categories(order, ordered=True, inplace=True)
gdf['time'].head()

   midday
    night
   midday
  evening
    night
Name: time, dtype: category
Categories (6, object): [morning < midday < afternoon < evening < night < allday]

You can see above that the data type of this column is now a category, ordered like this: [morning < midday < afternoon < evening < night < allday].

Now, let’s convert all of my points into two lines corresponding to the two parts of the trip:

# Get line of first part of trip
part1 = shapely.geometry.LineString(
    gdf.query('part == 1').sort_values(by=['day', 'time'])['geometry'].values
)
display(part1)

part2 = shapely.geometry.LineString(
    gdf.query('part == 2').sort_values(by=['day', 'time'])['geometry'].values
)

display(part2)

Wow, that was easier than I thought! Now, let’s figure out how to overlay these lines onto my map.

# Make a geodataframe with the two parts of my trip
linegdf = gpd.GeoDataFrame(
    {'geometry': [part1, part2],
     'trip_part': ['part1', 'part2']}
)

# Make the plot
fig, ax = plt.subplots(figsize=(15, 10))

# Plot the state outlines
states.boundary.plot(color='black', linewidth=0.5, ax=ax)

## Plot Part 1 of the trip

# I'm getting an error when I pass in the tuple, not sure why
c = to_hex(sns.light_palette('green')[0])

# Plot line
linegdf.query('trip_part == "part1"').plot(
    color=c,
    linewidth=3,
    ax=ax
)

# Plot points colored by day
gdf.query('part == 1').plot(
    column='day_number',
    cmap=sns.light_palette('green', as_cmap=True),
    ax=ax,
    markersize=50,
    edgecolor='black', linewidth=0.5,
    zorder=1000 # force the points to be the top layer of the plot
)

## Part 2
c = to_hex(sns.light_palette('purple')[0])

# Plot line
linegdf.query('trip_part == "part2"').plot(
    color=c,
    linewidth=3,
    ax=ax
)

# Plot points colored by day
gdf.query('part == 2').plot(
    column='day_number',
    cmap=sns.light_palette('purple', as_cmap=True),
    ax=ax,
    markersize=50,
    edgecolor='black', linewidth=0.5,
    zorder=1000 # force the points to be the top layer of the plot
)

# Add basemap
ctx.add_basemap(ax=ax)

# Remove axes
ax.set_axis_off()

Ah it’s so beautiful!! Let me tell you a little bit more about what you’re seeing:

For the first part of my trip, I went from San Diego to Austin in about a week. Then, I headed to New Orleans to meet Ben, and we drove from New Orleans to Atlanta (I did a bad job of tracking things during this time). Then, through a comical series of events, we ended up driving back to Boston from Atlanta in about two days. You can see the straight shot from Georgia to Pennsylvania, which is that part (we stayed with his sister in Philly as our one stop).

In Part 2, Ben and I drove straight down to Nashville (with the Philly pit stop again, thanks Micah!) I went to Ithaca for a hackathon (not shown here, because I flew and therefore accrued zero mileage) and Ben went back to Boston. Then I headed to Memphis for an amazing dinner with live music, and then made my way West to Utah. The little spike you see in Arkansas was when I tried to go to the Ozarks, but it was raining (and I had to drive through at least 6 inches of water to get out of my campground oops), so I made last-minute changes and headed to Hot Springs to have a nice soak. Similarly, the upshot into Colorado after Texas was when I realized that Great Sand Dunes National Park was actually not a huge detour from my current path, and that I wouldn’t have another good time to go there. Then I headed down to El Malpais (awesome place), Flagstaff to see the QIIME 2 team, and then up to the best two weeks of my trip: the national lands bonanza that is northern Arizona and Utah. You can see how my trajectory went north to the Grand Canyon, then east to get around the canyon and back west to hit Zion. Then I made my way through amazing Utah, through Bryce and Grand Staircase Escalante, until I finally headed to Denver to meet up with Ben. Denver has a lot of back-and-forth: Ben and I went into the mountains for the weekend and then came back to Denver, and I also had to spend an extra day in town to watch Nathaniel’s defense and figure out some job stuff (after spending the night in the prairie about two hours away). Then I headed up to Badlands in South Dakota, Yellowstone in Wyoming, and met up with Janyne and some friends in Nevada Catnip Reservoir. After that, the home stretch was just California: Susanville with Janyne, SF with Jeremy and my brothers, and finally back down to San Diego.

It’s awesome how looking at this map makes it so easy to remember each little part of my trip! I love it. :D

I’m going to go ahead and post this map as-is, even though there’s lots of ways I’d want to improve it. For example, I’d love to be able to see more clearly what day each part of the trip corresponded to, since there were some days where I drove huge distances and others when I got to stay and play where I was. Maybe I’ll try to cross-reference this map with my spreadsheet where I tracked the camps I stayed at. Or perhaps I’ll do the manual work of geocoding the camps directly, but that is a bit more work than here because I wasn’t very specific (e.g. when I stayed at my friend Jettie’s house in Austin, I wrote down “Jettie’s house” - which I really hope the geocoder won’t be able to find!)

Also, maybe the next step on this map is to make it interactive… I’ve recently learned about folium, which seems to be a super easy way to make interactive maps. Hopefully I find time to give it a go!

Claire Duvallet

Thank you for 18 years of DVDs, Netflix

Getting the data

How many movies did we rent? (But first: a lot of data cleaning)

18 years of rentals

Daily rental patterns

Return day consistency seems informative

Bring in the parents: putting my hypotheses to the test

Movie quantity over time

Early startup employee lessons learned, part 4: adapting to your changing role

Letting go of your legos

Don’t grow too fast

Common inflection points

Remaining a leader

Early startup employee lessons learned, part 3: building culture

Communicating about how we communicate

Talk about it!

Hire for communication

Implement processes to encourage meta-conversations

Intentional onboarding

Be the broken record

The Boston morning commute time warp

Early startup employee lessons learned, part 2: coping with the coaster

Coping with the coaster sine wave

Focus on your timeline

Divest from the mission (a little)

Update your comparators

Focus on the baby steps

Avoiding burnout

Early startup employee lessons learned, part 1: affecting change

You’re an employee, not a founder

Focus on your sphere of influence

Change starts at home

Influencing teams outside your sphere

Timing is everything

Finding free places to camp on a US road trip

Example 1: Sedona, AZ

Example 2: Quartzsite, AZ

Resources for pivoting to public health data science

Public Health Fellowships

CDC EIS

CDC Fellowships

ORISE Fellowships

APHL-CDC Fellowships

Postdocs with a public or global health focus

Big Data-Scientist Training Enhancement Program (BD-STEP)

Fulbright-Fogarty Fellows in Public Health

Global Health Program for Fellows and Scholars

International Research Scientist Development Award (IRSDA)

International fellowships

Luce Scholars

Princeton in Asia

Gates Foundation Global Health Fellows

Global health corps

Fulbright

Data science fellowships and postdoc funding

Data science for social good

IBM Social Good Fellowship

BIDS Data Science Fellows Program

Columbia Data Science Institute postdoctoral fellowships

Schmidt Science Fellows

Policy fellowships

AAAS

The Christine Mirzayan Science & Technology Policy Graduate Fellowship Program

Other federal opportunities

Public Health Service Corps

Data science at the NIH

18F

US Digital Service

Presidential Innovation Fellows

Job boards

Racism as a public health crisis: how wastewater epidemiology fits in

Racism is killing black people. It’s sickening them, too.

Social determinants of health (AKA racism)

Racism-associated stress and its biological consequences

Essential workers and unequal access to economic opportunity and public health prevention

Environmental racism

Wastewater epidemiology as a potential tool of oppression

Making change goes beyond sewage

Mapping my cross-country road trip with Python