Car Keys or Bike Helmet?

Decision trees and why it’s so hard to predict non-car trips, or any rare event.

Introduction

This analysis dives into some key transportation habits of residents of the Seattle/Puget Sound region. I particularly look at mode choice: what type of transportation (car, transit, bike, etc.) people use to take a given trip. As a transportation planner by training, I’m interested in how we can get more people out of private cars and onto healthier, more sustainable forms of transportation by improving infrastructure. And as someone who hasn’t owned a car in 12 years, I have a personal interest in improving car-free modes of transport.

I used the Household Travel Survey, a 2021 data set produced by the Puget Sound Regional Council from a week-long survey of 6,000 households and 125,000 trips in the region. Household travel surveys are common ways to learn more about travel behavior and are conducted by agencies across the United States.

I did this analysis in R, and you can view the code on Github. Here I’ll summarize the results.

What determines mode choice for a given trip?

Transportation planners are interested in getting people out of cars, which cause traffic and pollution, and encouraging more sustainable forms of transportation such as transit, walking, and biking. To do this, it’s helpful to first understand how people are traveling now as well as how they decide which mode to use for a given trip.

Disappointingly for me (but not surprisingly), three quarters of the trips in the travel survey dataset are taken by car, with a smaller number by transit, biking, and walking:

BikeCarTransitWalk/
Wheelchair
Other
1.3%76.7%3.8%17.3%0.9%
Travel mode used for all trips reported in the survey.

My goal is to see how well the metrics collected in this survey can predict mode choice for any given trip.

Photo by Viviana Rishe on Unsplash

First Try: Decision Tree

To start, I used a tool called a decision tree to model how different factors determine how somebody chooses to travel for a given trip. You can think of a decision tree like a flow chart, but the output gives likelihoods of certain outcomes. For example, if I had to come up with a decision tree of the top of my head for what influences mode choice, I might come up with something like this:

Chart created by author at http://www.miro.com

In reality, R builds its own version of this for me by looking at all the variables I give it and determining mathematically which model fits my data. In fact, it does this many, many times using a tool called random forest, which is really a whole bunch of decision trees averaged out to improve accuracy. Each individual tree is too complex to reproduce here, so I focus on the results.

To build the random forest model, I gave it these inputs:

Person-Level Variables

  • Age
  • Gender
  • Education level
  • Income
  • Whether they get free transit benefits from work
  • Whether they have a driver’s license
  • Number of vehicles they own
  • Where they live

Trip-Level Variables

  • Purpose
  • Distance
  • Number of people traveling together
  • Origin area

And here’s how well the random forest did at predicting mode choice by trip. The columns are the predicted mode, and the rows are the mode that was actually used. Thus the percentages in each row tell us what percentage of trips that were actually taken in that mode got predicted correctly. Ideally, every number would fall in the diagonal, in bold green, which shows the number of trips where the prediction was correct.

BikeDriveOtherTransitWalkTotal trips
actually taken

# (% correct)
Bike3523907575424 (8%)
Drive218,7221632197720,038 (93%)
Other02574810186492 (10%)
Transit111241116842973117 (54%)
Walk31262621360267510 (80%)
Total trips
predicted
41216048123947461
This table, called a confusion matrix, shows how well our predictions fared against reality.

If you look at just the Drive row, it appears to be not too bad. Out of 20,038 actual trips taken by car, the random forest correctly predicted 18,722 of them (93%) using the model I created. But if you read across the rows for other modes, you’ll see they tell a very different story. Only 8% of bike trips were predicted correctly, and 54% of transit trips. In all these cases, the model predicted that more people drove than actually did. Any model can be good at predicting driving if it just assumes that everyone drives, just as a doctor could be sure to never miss a cancer patient by telling everyone they have cancer. But it’s only truly a useful model if it is also good at predicting who doesn’t drive.

Photo by Evgeny Tchebotarev on Unsplash

By this standard, the random forest I created with these inputs didn’t do a great job at predicting mode choice. One reason is because the data is highly unbalanced – most trips in the sample are taken by car, regardless of the characteristics of the person or trip. While it is therefore easy to correctly predict that a trip was taken by car, it is not easy to catch the few instances where another mode is used because other modes besides walking are rare. Predicting rare events is a famously hard thing in predictive modeling. This is a challenge faced by those who build models to decide if a credit card transaction is fraud or to diagnose a rare disease.

Trying Again: Undersampling

Once I realized the data was unbalanced, I tried again with a different approach. Since part of what makes this hard is that there is too much data on driving compared to other modes, but I have plenty of data anyway, why not remove some of those drive trips from the model so the data is a little more balanced? This is known as undersampling, and is one way to deal with the problem of rare events.

I took a new sample of the dataset, this time one that was 50% drive trips and 50% trips from all other modes. This should help with predicting transit and walking trips, though “other” and bike trips might still be underrepresented.

Here’s how we did with the new model:

BikeDriveOtherTransitWalkTotal trips
actually taken

# (% correct)
Bike5518418877405 (14%)
Drive410,3481733786811,564 (89%)
Other121248118105484 (10%)
Transit28451619543403157 (62%)
Walk3771623464607474 (86%)
Total trips
predicted
65123607827317850
Confusion matrix after undersampling drive data.

This table shows some modest improvements for non-drive modes. This time we got 62% of transit trips correctly predicted (up from 54%), and 14% of bike trips (up from 8%) and 86% of walk trips (up from 80%). Still, those numbers are not as high as we’d like to see and, as I expected, they are worst for the modes that are still rare even after a bit of rebalancing. We also got slightly worse at predicting drive trips.

Conclusion

I started this post hoping it was going to be about a cool model that I wrote to predict travel habits. Instead, I’m writing about how it didn’t go as well as I’d hoped. Why am I sharing this? Because sharing challenges is a great way to learn from each other, and sharing models that aren’t perfect is an important part of data integrity. If I only reported the strongest models, I leave out the inconvenient fact that sometimes variables are not as related as we would like them to be, or give the impression that a relationship is more significant than it really is.

Rebalancing the sample helped us more accurately predict walking trips, since there were still a lot of those, but the model is still lacking at predicting bike and other trips, since those are still such a small fraction of the trips in the sample.

Another reason that our predictions are still off here is the variables I chose can only go so far in determining a person’s choice of mode for any given trip. The data I have doesn’t tell us if the destination has free parking, what kind of bike infrastructure is in the neighborhood, the quality of transit service, how much of a hurry the person is in, and how rainy it is that day. It’s also really hard with this data to measure how much choice the person really has in the matter of how they get around, or if driving is simply their default because nothing else is feasible in their neighborhood. I also don’t really know what the “Other” category refers to, and those trips may have little in common with each other, making prediction hard.

Photo by Clay LeConey on Unsplash

This analysis did not revolutionize my understanding of how people choose their trave mode, but it helped me gain an appreciation for the complexity of that choice. More data on non-car trips and focused research in this area could help planners to better understand what it will take to get people out of their cars.

For those smarter than me on this topic, what else would you try here?

Read more about undersampling with this helpful article at Analytics Vidhya, which informed some of this analysis.

Don’t miss a post!

Published by Kelly Dunn

Blogger about transportation and analytics.

Leave a comment