Tidy Data

Columns containing values, not variables

In the pew dataset column headers are values and not variable names.

Columns containing multiple variables

In the tuberculosis (TB) dataset columns contain multiple values: sex and age.

Variables in both rows and columns

In the weather dataset variables are stored in individual columns (id, year, month), spread across columns (day, d1-d31) and across rows (tmin, tmax).

Multiple observational units in a table (normalization)

Each type of observational unit should be stored in its own table. The billboard dataset needs to be broken down into two datasets: a song dataset which stores artist, song name and time, and a ranking dataset which gives the rank of the song in each week.

A single observational unit is stored in multiple tables.

Updated: