Are you looking for a dataset for data exploration and visualization?
Maybe you should consider the Palmer penguins dataset, which was published as an R package recently (Horst, Hill, and Gorman 2020).
I created the Julia package PalmerPenguins.jl
to simplify its use with the Julia programming language and increase its adoption within the Julia community.
TL;DR Link to heading
The Palmer penguins dataset is an alternative to the controversial iris
dataset for data exploration and visualization (but, of course, not the only one).
The Julia package PalmerPenguins.jl
provides access to the raw and simplified versions of this dataset, similar to the original R package, without having to download and parse the raw data manually.
Palmer penguins dataset Link to heading
The Palmer penguins dataset was proposed as an alternative to the iris
dataset by Fisher (1936) for data exploration and visualization.
π§π§π§
This penguin data is a great alternative to iris & available for use by CC0 π€© Thank you Dr. Kristen Gorman w/ @UAFcfos, Marty Downs w/ @USLTER, & @PalmerLTER for help, info & making it available for use π
Data, examples, & use info here: https://t.co/dSIqWNFlVw 𧡠1/6 pic.twitter.com/2Eu4AxoeZl
— Allison Horst (@allison_horst) June 8, 2020
Fisher was a vocal proponent of eugenics and published the iris
dataset in the Annals of Eugenics in 1936 (!).
Hence there is growing sentiment in the scientific community that the use of the iris
dataset is inappropriate.
One does not publish in the Annals of Eugenics in 1936 on a misunderstanding. By using this dataset in 2020, we are sending a very strong message.
β TimothΓ©e Poisot
Many people using iris will be unaware that it was first published in work by R A Fisher, a eugenicist with vile and harmful views on race. In fact, the iris dataset was originally published in the Annals of Eugenics. It is clear to me that knowingly using work that was itself used in pursuit of racist ideals is totally unacceptable.
β Megan Stodel
Iβve long known about Ronald Fisherβs eugenicist past, but I admit that I have often thoughtlessly turned to iris when needing a small, boring data set to demonstrate a coding or data principle.
But Daniella and TimothΓ©e Poisot are right: itβs time to retire iris.
Apart from that, the iris
dataset is quite boring:
it contains no missing values and:
With the exception of one or two points, the classes are linearly separable, and so classification algorithms reach almost perfect accuracy.
β TimothΓ©e Poisot
The Palmer penguins dataset consists of measurements of 344 penguins from three islands in the Palmer Archipelago, Antarctica, that were collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER (Gorman, Williams, and Fraser 2014).
The simplified version of the dataset contains at most seven measurements for each penguin, namely the species (Adelie
, Chinstrap
, and Gentoo
), the island (Torgersen
, Biscoe
, and Dream
), the bill length (measured in mm), the bill depth (measured in mm), the flipper length (measured in mm), the body mass (measured in g), and the sex (male
and female
).
In total, 19 measurements are missing.
Julia package Link to heading
The Julia package PalmerPenguins.jl
is available in the standard Julia package registry, so you can install it and load it in the usual way by running
using Pkg
Pkg.add("PalmerPenguins")
using PalmerPenguins
in the Julia REPL.
The package uses DataDeps.jl
to download a fixed (and hence reproducible) version of the dataset once instead of including a copy of the original dataset.
As explained in the package’s README, the simplified and the raw version of the Palmer penguins dataset can be loaded in a Tables.jl-compatible format. We can inspect the names and types of the features in the simplified and the raw version by running
using Tables
Tables.schema(PalmerPenguins.load())
Tables.Schema:
:species InlineStrings.String15
:island InlineStrings.String15
:bill_length_mm Union{Missing, Float64}
:bill_depth_mm Union{Missing, Float64}
:flipper_length_mm Union{Missing, Int64}
:body_mass_g Union{Missing, Int64}
:sex Union{Missing, InlineStrings.String7}
and
Tables.schema(PalmerPenguins.load(; raw=true))
Tables.Schema:
:studyName InlineStrings.String7
Symbol("Sample Number") Int64
:Species String
:Region InlineStrings.String7
:Island InlineStrings.String15
:Stage InlineStrings.String31
Symbol("Individual ID") InlineStrings.String7
Symbol("Clutch Completion") Bool
Symbol("Date Egg") Dates.Date
Symbol("Culmen Length (mm)") Union{Missing, Float64}
Symbol("Culmen Depth (mm)") Union{Missing, Float64}
Symbol("Flipper Length (mm)") Union{Missing, Int64}
Symbol("Body Mass (g)") Union{Missing, Int64}
:Sex Union{Missing, InlineStrings.String7}
Symbol("Delta 15 N (o/oo)") Union{Missing, Float64}
Symbol("Delta 13 C (o/oo)") Union{Missing, Float64}
:Comments Union{Missing, String}
We also see that the names of the features in the simplified dataset are normalized to lowercase characters without whitespace and brackets.
You might want to convert the tables to a DataFrame
object for downstream analyses.
The following code extracts the first five rows of the simplified dataset:
using DataFrames
first(DataFrame(PalmerPenguins.load()), 5)
5Γ7 DataFrame
Row β species island bill_length_mm bill_depth_mm flipper_length_mm β―
β String15 String15 Float64? Float64? Int64? β―
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β Adelie Torgersen 39.1 18.7 181 β―
2 β Adelie Torgersen 39.5 17.4 186
3 β Adelie Torgersen 40.3 18.0 195
4 β Adelie Torgersen missing missing missing
5 β Adelie Torgersen 36.7 19.3 193 β―
2 columns omitted
Data can be extracted with the Tables.jl-interface as well without creating a DataFrame
object, as shown in the following visualizations of the Palmer penguins dataset.
The following plots replicate the official examples (even interactively!).
using PlotlyJS
trace = let data = PalmerPenguins.load()
scatter(;
mode="markers",
x=Tables.getcolumn(data, :flipper_length_mm),
y=Tables.getcolumn(data, :body_mass_g),
transforms=[
attr(;
type="groupby",
groups=Tables.getcolumn(data, :species),
),
],
)
end
layout = Layout(;
title=attr(; text="Flipper length and body mass", x=0.5, xanchor="center"),
xaxis=attr(; title="Flipper length (mm)"),
yaxis=attr(; title="Body mass (g)"),
template=templates["simple_white"],
)
plt = PlotlyJS.plot([trace], layout)
trace = let data = PalmerPenguins.load()
histogram(;
x=Tables.getcolumn(data, :flipper_length_mm),
opacity=0.75,
transforms=[
attr(;
type="groupby",
groups=Tables.getcolumn(data, :species),
),
],
)
end
layout = Layout(;
title=attr(; text="Flipper length", x=0.5, xanchor="center"),
xaxis=attr(; title="Flipper length (mm)"),
yaxis=attr(; title="Frequency"),
barmode="overlay",
template=templates["simple_white"],
)
plt = PlotlyJS.plot([trace], layout)