PalmerPenguins.jl

Are you looking for a dataset for data exploration and visualization? Maybe you should consider the Palmer penguins dataset, which was published as an R package recently (Horst, Hill, and Gorman 2020). I created the Julia package PalmerPenguins.jl to simplify its use with the Julia programming language and increase its adoption within the Julia community.

TL;DR Link to heading

The Palmer penguins dataset is an alternative to the controversial iris dataset for data exploration and visualization (but, of course, not the only one). The Julia package PalmerPenguins.jl provides access to the raw and simplified versions of this dataset, similar to the original R package, without having to download and parse the raw data manually.

Palmer penguins dataset Link to heading

The Palmer penguins dataset was proposed as an alternative to the iris dataset by Fisher (1936) for data exploration and visualization.

Fisher was a vocal proponent of eugenics and published the iris dataset in the Annals of Eugenics in 1936 (!). Hence there is growing sentiment in the scientific community that the use of the iris dataset is inappropriate.

One does not publish in the Annals of Eugenics in 1936 on a misunderstanding. By using this dataset in 2020, we are sending a very strong message.

β€” TimothΓ©e Poisot

Many people using iris will be unaware that it was first published in work by R A Fisher, a eugenicist with vile and harmful views on race. In fact, the iris dataset was originally published in the Annals of Eugenics. It is clear to me that knowingly using work that was itself used in pursuit of racist ideals is totally unacceptable.

β€” Megan Stodel

I’ve long known about Ronald Fisher’s eugenicist past, but I admit that I have often thoughtlessly turned to iris when needing a small, boring data set to demonstrate a coding or data principle.

But Daniella and TimothΓ©e Poisot are right: it’s time to retire iris.

β€” Garrick Aden-Buie

Apart from that, the iris dataset is quite boring: it contains no missing values and:

With the exception of one or two points, the classes are linearly separable, and so classification algorithms reach almost perfect accuracy.

β€” TimothΓ©e Poisot

The Palmer penguins dataset consists of measurements of 344 penguins from three islands in the Palmer Archipelago, Antarctica, that were collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER (Gorman, Williams, and Fraser 2014). The simplified version of the dataset contains at most seven measurements for each penguin, namely the species (Adelie, Chinstrap, and Gentoo), the island (Torgersen, Biscoe, and Dream), the bill length (measured in mm), the bill depth (measured in mm), the flipper length (measured in mm), the body mass (measured in g), and the sex (male and female). In total, 19 measurements are missing.

Palmer penguins

Figure 1: Palmer penguins. Artwork by @allison_horst.

Julia package Link to heading

The Julia package PalmerPenguins.jl is available in the standard Julia package registry, so you can install it and load it in the usual way by running

using Pkg
Pkg.add("PalmerPenguins")

using PalmerPenguins

in the Julia REPL. The package uses DataDeps.jl to download a fixed (and hence reproducible) version of the dataset once instead of including a copy of the original dataset.

As explained in the package’s README, the simplified and the raw version of the Palmer penguins dataset can be loaded in a Tables.jl-compatible format. We can inspect the names and types of the features in the simplified and the raw version by running

using Tables

Tables.schema(PalmerPenguins.load())
Tables.Schema:
 :species            InlineStrings.String15
 :island             InlineStrings.String15
 :bill_length_mm     Union{Missing, Float64}
 :bill_depth_mm      Union{Missing, Float64}
 :flipper_length_mm  Union{Missing, Int64}
 :body_mass_g        Union{Missing, Int64}
 :sex                Union{Missing, InlineStrings.String7}

and

Tables.schema(PalmerPenguins.load(; raw=true))
Tables.Schema:
 :studyName                     InlineStrings.String7
 Symbol("Sample Number")        Int64
 :Species                       String
 :Region                        InlineStrings.String7
 :Island                        InlineStrings.String15
 :Stage                         InlineStrings.String31
 Symbol("Individual ID")        InlineStrings.String7
 Symbol("Clutch Completion")    Bool
 Symbol("Date Egg")             Dates.Date
 Symbol("Culmen Length (mm)")   Union{Missing, Float64}
 Symbol("Culmen Depth (mm)")    Union{Missing, Float64}
 Symbol("Flipper Length (mm)")  Union{Missing, Int64}
 Symbol("Body Mass (g)")        Union{Missing, Int64}
 :Sex                           Union{Missing, InlineStrings.String7}
 Symbol("Delta 15 N (o/oo)")    Union{Missing, Float64}
 Symbol("Delta 13 C (o/oo)")    Union{Missing, Float64}
 :Comments                      Union{Missing, String}

We also see that the names of the features in the simplified dataset are normalized to lowercase characters without whitespace and brackets.

You might want to convert the tables to a DataFrame object for downstream analyses. The following code extracts the first five rows of the simplified dataset:

using DataFrames

first(DataFrame(PalmerPenguins.load()), 5)
5Γ—7 DataFrame
 Row β”‚ species   island     bill_length_mm  bill_depth_mm  flipper_length_mm   β‹―
     β”‚ String15  String15   Float64?        Float64?       Int64?              β‹―
─────┼──────────────────────────────────────────────────────────────────────────
   1 β”‚ Adelie    Torgersen            39.1           18.7                181   β‹―
   2 β”‚ Adelie    Torgersen            39.5           17.4                186
   3 β”‚ Adelie    Torgersen            40.3           18.0                195
   4 β”‚ Adelie    Torgersen       missing        missing              missing
   5 β”‚ Adelie    Torgersen            36.7           19.3                193   β‹―
                                                               2 columns omitted

Data can be extracted with the Tables.jl-interface as well without creating a DataFrame object, as shown in the following visualizations of the Palmer penguins dataset. The following plots replicate the official examples (even interactively!).

using PlotlyJS

trace = let data = PalmerPenguins.load()
    scatter(;
        mode="markers",
        x=Tables.getcolumn(data, :flipper_length_mm),
        y=Tables.getcolumn(data, :body_mass_g),
        transforms=[
            attr(;
                type="groupby",
                groups=Tables.getcolumn(data, :species),
            ),
        ],
    )
end

layout = Layout(;
    title=attr(; text="Flipper length and body mass", x=0.5, xanchor="center"),
    xaxis=attr(; title="Flipper length (mm)"),
    yaxis=attr(; title="Body mass (g)"),
    template=templates["simple_white"],
)

plt = PlotlyJS.plot([trace], layout)
trace = let data = PalmerPenguins.load()
    histogram(;
        x=Tables.getcolumn(data, :flipper_length_mm),
        opacity=0.75,
        transforms=[
            attr(;
                type="groupby",
                groups=Tables.getcolumn(data, :species),
            ),
        ],
    )
end

layout = Layout(;
    title=attr(; text="Flipper length", x=0.5, xanchor="center"),
    xaxis=attr(; title="Flipper length (mm)"),
    yaxis=attr(; title="Frequency"),
    barmode="overlay",
    template=templates["simple_white"],
)

plt = PlotlyJS.plot([trace], layout)

References Link to heading

Fisher, R. A. 1936. β€œThe Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
Gorman, K. B., T. D. Williams, and W. R. Fraser. 2014. β€œEcological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis).” PLoS ONE 9 (3): e90081. https://doi.org/10.1371/journal.pone.0090081.
Horst, A. M., A. P. Hill, and K. B. Gorman. 2020. β€œPalmerpenguins: Palmer Archipelago (Antarctica) Penguin Data.” https://doi.org/10.5281/zenodo.3960218.