Code
using Pkg
Pkg.add("PalmerPenguins")
using PalmerPenguins
July 28, 2020
November 25, 2024
The Palmer penguins dataset is an alternative to the controversial iris
dataset for data exploration and visualization (but, of course, not the only one). The Julia package PalmerPenguins.jl
provides access to the raw and simplified versions of this dataset, similar to the original R package, without having to download and parse the raw data manually.
Are you looking for a dataset for data exploration and visualization? Maybe you should consider the Palmer penguins dataset, which was published as an R package recently (Horst, Hill, and Gorman 2020). I created the Julia package PalmerPenguins.jl
to simplify its use with the Julia programming language and increase its adoption within the Julia community.
The Palmer penguins dataset was proposed as an alternative to the iris
dataset by Fisher (1936) for data exploration and visualization.
Fisher was a vocal proponent of eugenics and published the iris
dataset in the Annals of Eugenics in 1936 (!). Hence there is growing sentiment in the scientific community that the use of the iris
dataset is inappropriate.
One does not publish in the Annals of Eugenics in 1936 on a misunderstanding. By using this dataset in 2020, we are sending a very strong message.
Many people using iris will be unaware that it was first published in work by R A Fisher, a eugenicist with vile and harmful views on race. In fact, the iris dataset was originally published in the Annals of Eugenics. It is clear to me that knowingly using work that was itself used in pursuit of racist ideals is totally unacceptable.
I’ve long known about Ronald Fisher’s eugenicist past, but I admit that I have often thoughtlessly turned to iris when needing a small, boring data set to demonstrate a coding or data principle.
But Daniella and Timothée Poisot are right: it’s time to retire iris.
Apart from that, the iris
dataset is quite boring: it contains no missing values and:
With the exception of one or two points, the classes are linearly separable, and so classification algorithms reach almost perfect accuracy.
The Palmer penguins dataset consists of measurements of 344 penguins from three islands in the Palmer Archipelago, Antarctica, that were collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER (Gorman, Williams, and Fraser 2014). The simplified version of the dataset contains at most seven measurements for each penguin, namely the species (Adelie
, Chinstrap
, and Gentoo
), the island (Torgersen
, Biscoe
, and Dream
), the bill length (measured in mm), the bill depth (measured in mm), the flipper length (measured in mm), the body mass (measured in g), and the sex (male
and female
). In total, 19 measurements are missing.
The Julia package PalmerPenguins.jl
is available in the standard Julia package registry, so you can install it and load it in the usual way by running
in the Julia REPL. The package uses DataDeps.jl
to download a fixed (and hence reproducible) version of the dataset once instead of including a copy of the original dataset.
As explained in the package’s README, the simplified and the raw version of the Palmer penguins dataset can be loaded in a Tables.jl-compatible format. We can inspect the names and types of the features in the simplified and the raw version by running
Tables.Schema:
:species InlineStrings.String15
:island InlineStrings.String15
:bill_length_mm Union{Missing, Float64}
:bill_depth_mm Union{Missing, Float64}
:flipper_length_mm Union{Missing, Int64}
:body_mass_g Union{Missing, Int64}
:sex Union{Missing, InlineStrings.String7}
and
Tables.Schema:
:studyName InlineStrings.String7
Symbol("Sample Number") Int64
:Species String
:Region InlineStrings.String7
:Island InlineStrings.String15
:Stage InlineStrings.String31
Symbol("Individual ID") InlineStrings.String7
Symbol("Clutch Completion") Bool
Symbol("Date Egg") Dates.Date
Symbol("Culmen Length (mm)") Union{Missing, Float64}
Symbol("Culmen Depth (mm)") Union{Missing, Float64}
Symbol("Flipper Length (mm)") Union{Missing, Int64}
Symbol("Body Mass (g)") Union{Missing, Int64}
:Sex Union{Missing, InlineStrings.String7}
Symbol("Delta 15 N (o/oo)") Union{Missing, Float64}
Symbol("Delta 13 C (o/oo)") Union{Missing, Float64}
:Comments Union{Missing, String}
We also see that the names of the features in the simplified dataset are normalized to lowercase characters without whitespace and brackets.
You might want to convert the tables to a DataFrame
object for downstream analyses. The following code extracts the first five rows of the simplified dataset:
Row | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|---|---|
String15 | String15 | Float64? | Float64? | Int64? | Int64? | String7? | |
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female |
3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female |
4 | Adelie | Torgersen | missing | missing | missing | missing | missing |
5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female |
Data can be extracted with the Tables.jl-interface as well without creating a DataFrame
object, as shown in the following visualizations of the Palmer penguins dataset. The following plots replicate the official examples (even interactively!).
using PlotlyJS
trace = let data = PalmerPenguins.load()
scatter(;
mode="markers",
x=Tables.getcolumn(data, :flipper_length_mm),
y=Tables.getcolumn(data, :body_mass_g),
transforms=[
attr(;
type="groupby",
groups=Tables.getcolumn(data, :species),
),
],
)
end
layout = Layout(;
xaxis=attr(; title="Flipper length (mm)"),
yaxis=attr(; title="Body mass (g)"),
template=templates["simple_white"],
)
PlotlyJS.plot([trace], layout)
trace = let data = PalmerPenguins.load()
histogram(;
x=Tables.getcolumn(data, :flipper_length_mm),
opacity=0.75,
transforms=[
attr(;
type="groupby",
groups=Tables.getcolumn(data, :species),
),
],
)
end
layout = Layout(;
xaxis=attr(; title="Flipper length (mm)"),
yaxis=attr(; title="Frequency"),
barmode="overlay",
template=templates["simple_white"],
)
PlotlyJS.plot([trace], layout)
---
title: "PalmerPenguins.jl"
date: 2020-07-28
date-modified: 2024-11-25
aliases:
- "/blog/2020/07/palmerpenguins"
categories:
- penguins
- julia
abstract: >
The Palmer penguins dataset is an alternative to the controversial `iris` dataset for data exploration and visualization (but, of course, [not the only one](https://www.meganstodel.com/posts/no-to-iris/)).
The Julia package [`PalmerPenguins.jl`](https://github.com/devmotion/PalmerPenguins.jl) provides access to the raw and simplified versions of this dataset, similar to the original R package, without having to download and parse the raw data manually.
format:
html:
code-fold: show
code-tools: true
code-links:
- text: Project.toml
icon: file-code
href: Project.toml
- text: Manifest.toml
icon: file-code
href: Manifest.toml
bibliography: references.bib
engine: julia
julia:
env: ["DATADEPS_ALWAYS_ACCEPT=true"]
---
Are you looking for a dataset for data exploration and visualization?
Maybe you should consider the [Palmer penguins dataset](https://allisonhorst.github.io/palmerpenguins/), which was published as an [R package](https://cloud.r-project.org/web/packages/palmerpenguins/index.html) recently [@Horst2020].
I created the Julia package [`PalmerPenguins.jl`](https://github.com/devmotion/PalmerPenguins.jl) to simplify its use with the Julia programming language and increase its adoption within the Julia community.
## Palmer penguins dataset {#palmer-penguins-dataset}
The Palmer penguins dataset was proposed as an alternative to the [`iris` dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) by @Fisher1936 for data exploration and visualization.
{{< share-post https://twitter.com/allison_horst/status/1270046399418138625 >}}
Fisher was a vocal proponent of eugenics and published the `iris` dataset in the **Annals of Eugenics** in 1936 (!).
Hence there is growing sentiment in the scientific community that the use of the `iris` dataset is inappropriate.
> One does not publish in the Annals of Eugenics in 1936 on a misunderstanding.
> By using this dataset in 2020, we are sending a very strong message.
>
> — [Timothée Poisot](https://armchairecology.blog/iris-dataset)
> Many people using iris will be unaware that it was first published in work by R A Fisher, a eugenicist with vile and harmful views on race.
> In fact, the iris dataset was originally published in the Annals of Eugenics.
> It is clear to me that knowingly using work that was itself used in pursuit of racist ideals is totally unacceptable.
>
> — [Megan Stodel](https://www.meganstodel.com/posts/no-to-iris)
> I’ve long known about Ronald Fisher’s eugenicist past, but I admit that I have often thoughtlessly turned to iris when needing a small, boring data set to demonstrate a coding or data principle.
>
> But Daniella and Timothée Poisot are right: it’s time to retire iris.
>
> — [Garrick Aden-Buie](https://www.garrickadenbuie.com/blog/lets-move-on-from-iris)
Apart from that, the `iris` dataset is quite boring:
it contains no missing values and:
> With the exception of one or two points, the classes are linearly separable, and so classification algorithms reach almost perfect accuracy.
>
> — [Timothée Poisot](https://armchairecology.blog/iris-dataset)
The Palmer penguins dataset consists of measurements of 344 penguins from three islands in the [Palmer Archipelago, Antarctica](https://en.wikipedia.org/wiki/Palmer_Archipelago), that were collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/) [@Gorman2014].
The simplified version of the dataset contains at most seven measurements for each penguin, namely the species (`Adelie`, `Chinstrap`, and `Gentoo`), the island (`Torgersen`, `Biscoe`, and `Dream`), the bill length (measured in mm), the bill depth (measured in mm), the flipper length (measured in mm), the body mass (measured in g), and the sex (`male` and `female`).
In total, 19 measurements are missing.
 by [\@allison_horst](https://twitter.com/allison_horst).](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)
## Julia package {#julia-package}
The Julia package `PalmerPenguins.jl` is available in the standard Julia package registry, so you can install it and load it in the usual way by running
```{julia}
#| output: false
using Pkg
Pkg.add("PalmerPenguins")
using PalmerPenguins
```
in the Julia REPL.
The package uses [`DataDeps.jl`](https://github.com/oxinabox/DataDeps.jl) to download a fixed (and hence reproducible) version of the dataset once instead of including a copy of the original dataset.
As explained in the [package's README](https://github.com/devmotion/PalmerPenguins.jl/blob/master/README.md), the simplified and the raw version of the Palmer penguins dataset can be loaded in a [Tables.jl-compatible format](https://github.com/JuliaData/Tables.jl).
We can inspect the names and types of the features in the simplified and the raw version by running
```{julia}
using Tables
Tables.schema(PalmerPenguins.load())
```
and
```{julia}
Tables.schema(PalmerPenguins.load(; raw=true))
```
We also see that the names of the features in the simplified dataset are normalized to lowercase characters without whitespace and brackets.
You might want to convert the tables to a `DataFrame` object for downstream analyses.
The following code extracts the first five rows of the simplified dataset:
```{julia}
using DataFrames
first(DataFrame(PalmerPenguins.load()), 5)
```
Data can be extracted with the Tables.jl-interface as well without creating a `DataFrame` object, as shown in the following visualizations of the Palmer penguins dataset.
The following plots replicate the [official examples](https://allisonhorst.github.io/palmerpenguins/#examples) (even interactively!).
```{julia}
#| fig-cap: "Body mass versus flipper length."
using PlotlyJS
trace = let data = PalmerPenguins.load()
scatter(;
mode="markers",
x=Tables.getcolumn(data, :flipper_length_mm),
y=Tables.getcolumn(data, :body_mass_g),
transforms=[
attr(;
type="groupby",
groups=Tables.getcolumn(data, :species),
),
],
)
end
layout = Layout(;
xaxis=attr(; title="Flipper length (mm)"),
yaxis=attr(; title="Body mass (g)"),
template=templates["simple_white"],
)
PlotlyJS.plot([trace], layout)
```
```{julia}
#| fig-cap: "Flipper length."
trace = let data = PalmerPenguins.load()
histogram(;
x=Tables.getcolumn(data, :flipper_length_mm),
opacity=0.75,
transforms=[
attr(;
type="groupby",
groups=Tables.getcolumn(data, :species),
),
],
)
end
layout = Layout(;
xaxis=attr(; title="Flipper length (mm)"),
yaxis=attr(; title="Frequency"),
barmode="overlay",
template=templates["simple_white"],
)
PlotlyJS.plot([trace], layout)
```
## References