COVID-19 and air travel, with Pandas

Irina Truong
j-bennet codes
Published in
4 min readMar 21, 2021

--

A recent article in Washington Post claimed that air travel is significantly less dangerous than driving. A lot of people prefer to travel by car these days, since there’s a possibility to catch COVID-19 either on the plane, or at the airport. The article, however, claims, that the risk to die in a car accident while driving is quite a bit higher than the risk of catching COVID on the plane and dying.

https://www.washingtonpost.com/opinions/2021/03/15/flying-safer-than-driving-pandemic/

I shared the article with my friends, but some of them were disappointed not to find any useful information for their age group, since the article only spoke about people younger than 65 years old.

Digging deeper into the original source, we can do much better than that!

The original source

The original source comes from the CDC website:

https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf

This is the data of all COVID-19 cases, with their outcomes, in the United States, as of March 17, 2021. The data has age group, race, and sex characteristics. I wanted to calculate the probability of catching COVID-19 on the flight and dying from it for every race and age group. Let’s go.

Loading and aggregating data

First, let’s just read the original dataset.

I’m only loading selected columns here. I’m also providing dtypes, to specify data types explicitly, which reduces memory usage.

Every record in the original dataset corresponds to one COVID-19 patient. For this patient, the column death_yn specifies whether the person died. I want to group and aggregate this data, to know how many people got sick, and how many died, in each age group and race/ethnicity.

Now that we have cases and deaths, in the same dataframe, we can calculate the probability of dying for each of the above groups. Calculating probability is simple:

P = number of deaths / number of cases

However, CDC estimates that we only detect 1 COVID-19 case in about 4.6, because a lot of people are asymptomatic or have mild symptoms, so they never even get tested. So our probability has to be reduced by the factor of 4.6:

Next step: how many people get sick when flying

Now we know the probability of dying of COVID-19 for each age / race group. This is not enough, however. We also have to know the chances of catching COVID-19 when traveling by air. And the following study has the numbers:

https://www.medrxiv.org/content/10.1101/2020.07.02.20143826v4.full-text

It calculates separate probabilities for a full flight, and for a flight with empty middle seats. Probabilities are different for aisle, middle or window seat, but I disregarded these differences and used an average here. If you’re curious, go ahead and calculate those probabilities by yourself!

I copied the numbers into a CSV file and loaded into a dataframe:

This dataframe can now be used to calculate conditional probability of catching COVID-19 on the flight, and dying from it. Conditional probability means multiplying the two probabilities:

P conditional = P getting sick * P dying

And here are the actual numbers:

The numbers are very small. They are easier to digest if converted into odds:

Odds = 1 / P

A picture is worth a thousand words

Let’s visualize this data to make it even more friendly.

Here is the final plot:

Probability of catching COVID-19 on flight and dying

As for the driving

Data about driving deaths comes from this source:

Probability of dying in a car crash, of course, depends on the trip length. It is also linked to the age of the driver, but we don’t have the data split on both criteria. Let’s just calculate the probability of dying in a car crash for a few trip lengths, based on the statistics in 2019 (latest record in the table).

Here is the result:

And now you’re armed with all the data you need to figure out whether it’s safer to drive or fly, based on your age and how far you’re going to travel. Right?

Well, yes, this was a fun little exercise. Our data is not perfect though. First of all, these datasets on both motor vehicle deaths, and COVID infections, only cover the United States, not the whole world. And of course, there’s many more factors that come into play, such as level of personal protection, driving experience, and many more, and one always has to use their common sense when making such a decision. But I hope this was somewhat useful, if only to learn a bit about probabilities on such an interesting example.

--

--