COVID-19 Open-Data
COVID-19 Open-Data attempts to assemble the largest Covid-19 epidemiological database, in addition to a powerful set of expansive covariates. It includes open, publicly sourced, licensed data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, weather, and more.
The details are in GitHub here.
It's easy to insert this data into ClickHouse...
Note
The following commands were executed on a Production instance of ClickHouse Cloud. You can easily run them on a local install as well.
- Let's see what the data looks like:
DESCRIBE url(
'https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv',
'CSVWithNames'
);
The CSV file has 10 columns:
┌─name─────────────────┬─type─────────────┐
│ date │ Nullable(Date) │
│ location_key │ Nullable(String) │
│ new_confirmed │ Nullable(Int64) │
│ new_deceased │ Nullable(Int64) │
│ new_recovered │ Nullable(Int64) │
│ new_tested │ Nullable(Int64) │
│ cumulative_confirmed │ Nullable(Int64) │
│ cumulative_deceased │ Nullable(Int64) │
│ cumulative_recovered │ Nullable(Int64) │
│ cumulative_tested │ Nullable(Int64) │
└──────────────────────┴──────────────────┘
10 rows in set. Elapsed: 0.745 sec.
- Now let's view some of the rows:
SELECT *
FROM url('https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv')
LIMIT 100;
Notice the url
function easily reads data from a CSV file:
┌─c1─────────┬─c2───────────┬─c3────────────┬─c4───────────┬─c5────────────┬─c6─────────┬─c7──────────────── ───┬─c8──────────────────┬─c9───────────────────┬─c10───────────────┐
│ date │ location_key │ new_confirmed │ new_deceased │ new_recovered │ new_tested │ cumulative_confirmed │ cumulative_deceased │ cumulative_recovered │ cumulative_tested │
│ 2020-04-03 │ AD │ 24 │ 1 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 466 │ 17 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
│ 2020-04-04 │ AD │ 57 │ 0 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 523 │ 17 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
│ 2020-04-05 │ AD │ 17 │ 4 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 540 │ 21 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
│ 2020-04-06 │ AD │ 11 │ 1 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 551 │ 22 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
│ 2020-04-07 │ AD │ 15 │ 2 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 566 │ 24 │ ᴺᵁᴸ ᴸ │ ᴺᵁᴸᴸ │
│ 2020-04-08 │ AD │ 23 │ 2 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 589 │ 26 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
└────────────┴──────────────┴───────────────┴──────────────┴───────────────┴────────────┴──────────────────────┴─────────────────────┴──────────────────────┴───────────────────┘
- We will create a table now that we know what the data looks like:
CREATE TABLE covid19 (
date Date,
location_key LowCardinality(String),
new_confirmed Int32,
new_deceased Int32,
new_recovered Int32,
new_tested Int32,
cumulative_confirmed Int32,
cumulative_deceased Int32,
cumulative_recovered Int32,
cumulative_tested Int32
)
ENGINE = MergeTree
ORDER BY (location_key, date);
- The following command inserts the entire dataset into the
covid19
table:
INSERT INTO covid19
SELECT *
FROM
url(
'https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv',
CSVWithNames,
'date Date,
location_key LowCardinality(String),
new_confirmed Int32,
new_deceased Int32,
new_recovered Int32,
new_tested Int32,
cumulative_confirmed Int32,
cumulative_deceased Int32,
cumulative_recovered Int32,
cumulative_tested Int32'
);
- It goes pretty quick - let's see how many rows were inserted:
SELECT formatReadableQuantity(count())
FROM covid19;
┌─formatReadableQuantity(count())─┐
│ 12.53 million │
└─────────────────────────────────┘
- Let's see how many total cases of Covid-19 were recorded:
SELECT formatReadableQuantity(sum(new_confirmed))
FROM covid19;