How to lie with data, Part 1, or California’s COVID rate is now twice that of Florida’s

Irina Truong
j-bennet codes
Published in
5 min readNov 12, 2021

--

I’ve read this article recently:

And something didn’t quite add up in my head. I decided to look at the data a little more closely. The numbers don’t lie, so why is California doing so badly? Did Florida really get it right, while Californians suffered from the lockdowns for no good reason?

Let’s find out. First of all, we need some COVID-19 data by state, and a great resource for that is this CDC dataset:

Let’s use sodapy to retrieve the data, and Pandas to work with it:

In the snippet above, I’m slightly reshaping the original CA and FL dataframes, making sure to assign correct datatypes to datetime and numeric values. I’m also omitting columns I won’t use. At this point, each dataframe looks like this:

Florida and California are large states and somewhat similar in size, but not exactly the same. So I’d like to not only see the absolute numbers, but recalculate them per 100,000 of population. I downloaded the population dataset from https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html. Here are the numbers:

In 2019, estimated population of California was 39,512,223 people, while in Florida it was 21,477,737 people. Unfortunately, I don’t have the numbers for 2020 and 2021, but normalizing to the 2019 population is better than nothing.

In the snippet above, we add new calculated columns to the dataframes: new_case_per_100k, new_death_per_100k, tot_case_per_100k, tot_death_per_100k. In line 11, we combine the two dataframes into one.

Plotting the absolute numbers

We can visualize the numbers now. Here is a chart of new cases per day:

New cases per day, FL vs CA

And here is a chart of new deaths per day:

New deaths per day, FL vs CA

The code that was used to create these charts (with seaborn):

Things in California are going pretty awful, aren’t they? California is having larger, spikier waves, while Florida waves look pretty moderate.

Ah, but this is not the whole story. For the whole story, let’s look at the numbers normalized per 100K of the population.

Plotting the normalized numbers

New cases per day, adjusted per 100K of the population:

New cases per day normalized, FL vs CA

New deaths per day, adjusted per 100K of the population:

New deaths per day normalized, FL vs CA

The picture looks a lot different now, doesn’t it? The one with the spikier waves is now Florida, and this is especially true in case of the 1st and the 3rd wave. In the 2nd wave, the two states did about equally well (or equally badly), and we can see that the Florida wave has multiple spikes, while the California wave spikes once, but very sharply.

But why is California’s new case rate now 2x that of Florida? If we look closely at the 3rd wave, we can see that the two states are simply in the different stages of their development. The Florida wave spiked a lot higher, but it’s now coming down. The California wave is smaller, but it arrived later, and is still in the process of flattening.

The code that was used to create the above charts:

Plotting the cumulative numbers

So did the California lockdowns do anything? To answer that question, let’s plot the total cases and deaths, and this time, focus only on the numbers per 100K of the population.

Total cases per 100K:

Total cases per 100K, Florida is orange, California is blue

Total deaths per 100K:

Total deaths per 100K, Florida is orange, California is blue

We can see how with time, Florida accumulates cases and deaths a lot quicker, while in California cases and deaths don’t balloon with quite the same speed. I’m not saying California is doing great, but is it doing better? To me, there’s hardly any doubt about that.

The code that was used to create these visualizations:

In the CDS dataset that I used, the last reported date is Nov 10, 2021. Let’s take one final look at the cumulative totals of cases and deaths, adjusted per 100K of the population:

To put it into words, as of Nov 10, 2021, 281 people per 100K died of COVID-19 in Florida. In California, the corresponding number is 182 people per 100K. 17K people got sick with COVID-19 per 100K in Florida. In California, the corresponding number is 12.6K people per 100K.

So how do we lie with data?

Now, for the interesting part. The title of this article promised you’d learn to lie with data! There are quite a few techniques, but the Fox News article above used 3:

  1. Use accurate numbers, but take them out of the context.
  2. Compare the absolute numbers, without any normalization. It is believable if the range of the numbers is close enough.
  3. Compare the subject at the same point in time, but in two different stages of development.

Misleading people using accurate data is a very interesting topic to me, and I plan to write more articles about it, using more examples. Stay tuned.

--

--