penguins
julia
Published

July 28, 2020

Modified

November 25, 2024

Abstract

The Palmer penguins dataset is an alternative to the controversial iris dataset for data exploration and visualization (but, of course, not the only one). The Julia package PalmerPenguins.jl provides access to the raw and simplified versions of this dataset, similar to the original R package, without having to download and parse the raw data manually.

Are you looking for a dataset for data exploration and visualization? Maybe you should consider the Palmer penguins dataset, which was published as an R package recently (Horst, Hill, and Gorman 2020). I created the Julia package PalmerPenguins.jl to simplify its use with the Julia programming language and increase its adoption within the Julia community.

Palmer penguins dataset

The Palmer penguins dataset was proposed as an alternative to the iris dataset by Fisher (1936) for data exploration and visualization.

Fisher was a vocal proponent of eugenics and published the iris dataset in the Annals of Eugenics in 1936 (!). Hence there is growing sentiment in the scientific community that the use of the iris dataset is inappropriate.

One does not publish in the Annals of Eugenics in 1936 on a misunderstanding. By using this dataset in 2020, we are sending a very strong message.

Timothée Poisot

Many people using iris will be unaware that it was first published in work by R A Fisher, a eugenicist with vile and harmful views on race. In fact, the iris dataset was originally published in the Annals of Eugenics. It is clear to me that knowingly using work that was itself used in pursuit of racist ideals is totally unacceptable.

Megan Stodel

I’ve long known about Ronald Fisher’s eugenicist past, but I admit that I have often thoughtlessly turned to iris when needing a small, boring data set to demonstrate a coding or data principle.

But Daniella and Timothée Poisot are right: it’s time to retire iris.

Garrick Aden-Buie

Apart from that, the iris dataset is quite boring: it contains no missing values and:

With the exception of one or two points, the classes are linearly separable, and so classification algorithms reach almost perfect accuracy.

Timothée Poisot

The Palmer penguins dataset consists of measurements of 344 penguins from three islands in the Palmer Archipelago, Antarctica, that were collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER (Gorman, Williams, and Fraser 2014). The simplified version of the dataset contains at most seven measurements for each penguin, namely the species (Adelie, Chinstrap, and Gentoo), the island (Torgersen, Biscoe, and Dream), the bill length (measured in mm), the bill depth (measured in mm), the flipper length (measured in mm), the body mass (measured in g), and the sex (male and female). In total, 19 measurements are missing.

Palmer penguins. Artwork by @allison_horst.

Julia package

The Julia package PalmerPenguins.jl is available in the standard Julia package registry, so you can install it and load it in the usual way by running

Code
using Pkg
Pkg.add("PalmerPenguins")

using PalmerPenguins

in the Julia REPL. The package uses DataDeps.jl to download a fixed (and hence reproducible) version of the dataset once instead of including a copy of the original dataset.

As explained in the package’s README, the simplified and the raw version of the Palmer penguins dataset can be loaded in a Tables.jl-compatible format. We can inspect the names and types of the features in the simplified and the raw version by running

Code
using Tables

Tables.schema(PalmerPenguins.load())
Tables.Schema:
 :species            InlineStrings.String15
 :island             InlineStrings.String15
 :bill_length_mm     Union{Missing, Float64}
 :bill_depth_mm      Union{Missing, Float64}
 :flipper_length_mm  Union{Missing, Int64}
 :body_mass_g        Union{Missing, Int64}
 :sex                Union{Missing, InlineStrings.String7}

and

Code
Tables.schema(PalmerPenguins.load(; raw=true))
Tables.Schema:
 :studyName                     InlineStrings.String7
 Symbol("Sample Number")        Int64
 :Species                       String
 :Region                        InlineStrings.String7
 :Island                        InlineStrings.String15
 :Stage                         InlineStrings.String31
 Symbol("Individual ID")        InlineStrings.String7
 Symbol("Clutch Completion")    Bool
 Symbol("Date Egg")             Dates.Date
 Symbol("Culmen Length (mm)")   Union{Missing, Float64}
 Symbol("Culmen Depth (mm)")    Union{Missing, Float64}
 Symbol("Flipper Length (mm)")  Union{Missing, Int64}
 Symbol("Body Mass (g)")        Union{Missing, Int64}
 :Sex                           Union{Missing, InlineStrings.String7}
 Symbol("Delta 15 N (o/oo)")    Union{Missing, Float64}
 Symbol("Delta 13 C (o/oo)")    Union{Missing, Float64}
 :Comments                      Union{Missing, String}

We also see that the names of the features in the simplified dataset are normalized to lowercase characters without whitespace and brackets.

You might want to convert the tables to a DataFrame object for downstream analyses. The following code extracts the first five rows of the simplified dataset:

Code
using DataFrames

first(DataFrame(PalmerPenguins.load()), 5)
5×7 DataFrame
Row species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
String15 String15 Float64? Float64? Int64? Int64? String7?
1 Adelie Torgersen 39.1 18.7 181 3750 male
2 Adelie Torgersen 39.5 17.4 186 3800 female
3 Adelie Torgersen 40.3 18.0 195 3250 female
4 Adelie Torgersen missing missing missing missing missing
5 Adelie Torgersen 36.7 19.3 193 3450 female

Data can be extracted with the Tables.jl-interface as well without creating a DataFrame object, as shown in the following visualizations of the Palmer penguins dataset. The following plots replicate the official examples (even interactively!).

Code
using PlotlyJS

trace = let data = PalmerPenguins.load()
    scatter(;
        mode="markers",
        x=Tables.getcolumn(data, :flipper_length_mm),
        y=Tables.getcolumn(data, :body_mass_g),
        transforms=[
            attr(;
                type="groupby",
                groups=Tables.getcolumn(data, :species),
            ),
        ],
    )
end

layout = Layout(;
    xaxis=attr(; title="Flipper length (mm)"),
    yaxis=attr(; title="Body mass (g)"),
    template=templates["simple_white"],
)

PlotlyJS.plot([trace], layout)
Code
trace = let data = PalmerPenguins.load()
    histogram(;
        x=Tables.getcolumn(data, :flipper_length_mm),
        opacity=0.75,
        transforms=[
            attr(;
                type="groupby",
                groups=Tables.getcolumn(data, :species),
            ),
        ],
    )
end

layout = Layout(;
    xaxis=attr(; title="Flipper length (mm)"),
    yaxis=attr(; title="Frequency"),
    barmode="overlay",
    template=templates["simple_white"],
)

PlotlyJS.plot([trace], layout)

References

Fisher, R. A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
Gorman, K. B., T. D. Williams, and W. R. Fraser. 2014. “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis).” PLoS ONE 9 (3): e90081. https://doi.org/10.1371/journal.pone.0090081.
Horst, A. M., Hill A. P., and K. B. Gorman. 2020. “Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data.” https://doi.org/10.5281/zenodo.3960218.