week | topic |
---|---|
BSMM8740-2-R-2024F [WEEK - 1]
week | topic |
---|---|
1 | The Tidyverse, EDA & Git |
2 | The Recipes Package |
3 | Regression Methods |
4 | The Tidymodels Packages |
5 | Classification & Clustering Methods |
6 | Time Series Methods |
7 | Causality: DAGs |
8 | Causality: Methods |
9 | Monte-Carlo Methods |
10 | Bayesian Methods |
A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative).
Values are organised in two ways. Every value belongs to a variable and an observation.
A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
Tidy data in practice:
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Some packages in the tidyverse.
The dplyr
package gives a grammar for data wrangling, including these 5 verbs for working with data frames.
select()
: take a subset of the columns (i.e., features, variables)filter()
: take a subset of the rows (i.e., observations)mutate()
: add or modify existing columnsarrange()
: sort the rowssummarize()
: aggregate the data across rows (e.g., group it according to some criteria)Each of these functions takes a data frame
as its first argument, and returns a data frame
.
Being able to combine these verbs with nouns (i.e., data frames) and adverbs (i.e., arguments) creates a flexible and powerful way to wrangle data.
filter()
The two simplest of the five verbs are filter()
and select()
, which return a subset of the rows or columns of a data frame, respectively.
The filter() function. At left, a data frame that contains matching entries in a certain column for only a subset of the rows. At right, the resulting data frame after filtering.
select()
The select() function. At left, a data frame, from which we retrieve only a few of the columns. At right, the resulting data frame after selecting those columns.
# A tibble: 12 × 4
name start end party
<chr> <date> <date> <chr>
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Kennedy 1961-01-20 1963-11-22 Democratic
3 Johnson 1963-11-22 1969-01-20 Democratic
4 Nixon 1969-01-20 1974-08-09 Republican
5 Ford 1974-08-09 1977-01-20 Republican
6 Carter 1977-01-20 1981-01-20 Democratic
7 Reagan 1981-01-20 1989-01-20 Republican
8 Bush 1989-01-20 1993-01-20 Republican
9 Clinton 1993-01-20 2001-01-20 Democratic
10 Bush 2001-01-20 2009-01-20 Republican
11 Obama 2009-01-20 2017-01-20 Democratic
12 Trump 2017-01-20 2021-01-20 Republican
To get just the names and parties of these presidents, use select()
. The first argument is the data frame, followed by the column names.
# A tibble: 12 × 2
name party
<chr> <chr>
1 Eisenhower Republican
2 Kennedy Democratic
3 Johnson Democratic
4 Nixon Republican
5 Ford Republican
6 Carter Democratic
7 Reagan Republican
8 Bush Republican
9 Clinton Democratic
10 Bush Republican
11 Obama Democratic
12 Trump Republican
Similarly, the first argument to filter()
is a data frame, and subsequent arguments are logical conditions that are evaluated on any involved columns.
# A tibble: 7 × 4
name start end party
<chr> <date> <date> <chr>
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Nixon 1969-01-20 1974-08-09 Republican
3 Ford 1974-08-09 1977-01-20 Republican
4 Reagan 1981-01-20 1989-01-20 Republican
5 Bush 1989-01-20 1993-01-20 Republican
6 Bush 2001-01-20 2009-01-20 Republican
7 Trump 2017-01-20 2021-01-20 Republican
Combining the filter()
and select()
commands enables one to drill down to very specific pieces of information.
As written the filter()
operation is nested inside the select()
operation.
With the pipe (%>%
), we can write the same expression as above in a more readable syntax.
mutate()
We might want to create, re-define, or rename some of our variables. A graphical illustration of the mutate()
operation is shown below
The mutate() function creating a column. At right, the resulting data frame after adding a new column.
my_presidents <- presidential %>%
dplyr::mutate(
term.length = lubridate::interval(start, end) / lubridate::dyears(1)
)
my_presidents
# A tibble: 12 × 5
name start end party term.length
<chr> <date> <date> <chr> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8
2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84
3 Johnson 1963-11-22 1969-01-20 Democratic 5.16
4 Nixon 1969-01-20 1974-08-09 Republican 5.55
5 Ford 1974-08-09 1977-01-20 Republican 2.45
6 Carter 1977-01-20 1981-01-20 Democratic 4
7 Reagan 1981-01-20 1989-01-20 Republican 8
8 Bush 1989-01-20 1993-01-20 Republican 4
9 Clinton 1993-01-20 2001-01-20 Democratic 8
10 Bush 2001-01-20 2009-01-20 Republican 8
11 Obama 2009-01-20 2017-01-20 Democratic 8
12 Trump 2017-01-20 2021-01-20 Republican 4
The mutate()
function can be used to modify existing columns. Below we add a variable containing the year in which each president was elected assuming that every president was elected in the year before he took office.
# A tibble: 4 × 6
name start end party term.length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8 1952
2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
3 Johnson 1963-11-22 1969-01-20 Democratic 5.16 1962
4 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
Some entries in this data set are wrong, because presidential elections are only held every four years, and some presidents are not elected (e.g. Johnson and Ford).
# A tibble: 6 × 6
name start end party term.length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8 1952
2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
3 Johnson 1963-11-22 1969-01-20 Democratic 5.16 NA
4 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
5 Ford 1974-08-09 1977-01-20 Republican 2.45 NA
6 Carter 1977-01-20 1981-01-20 Democratic 4 1976
rename()
It is considered bad practice to use a period in names (functions, variables, columns) - we should change the name of the term.length
column that we created earlier.
# A tibble: 9 × 6
name start end party term_length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8 1952
2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
3 Johnson 1963-11-22 1969-01-20 Democratic 5.16 NA
4 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
5 Ford 1974-08-09 1977-01-20 Republican 2.45 NA
6 Carter 1977-01-20 1981-01-20 Democratic 4 1976
7 Reagan 1981-01-20 1989-01-20 Republican 8 1980
8 Bush 1989-01-20 1993-01-20 Republican 4 1988
9 Clinton 1993-01-20 2001-01-20 Democratic 8 1992
arrange()
The function sort()
will sort a vector but not a data frame. The arrange()
function sorts a data frame:
The arrange() function. At left, a data frame with an ordinal variable. At right, the resulting data frame after sorting the rows in descending order of that variable.
To use arrange
you have to specify the data frame, and the column by which you want it to be sorted. You also have to specify the direction in which you want it to be sorted.
# A tibble: 9 × 6
name start end party term_length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 8 1952
2 Reagan 1981-01-20 1989-01-20 Republican 8 1980
3 Clinton 1993-01-20 2001-01-20 Democratic 8 1992
4 Bush 2001-01-20 2009-01-20 Republican 8 2000
5 Obama 2009-01-20 2017-01-20 Democratic 8 2008
6 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
7 Johnson 1963-11-22 1969-01-20 Democratic 5.16 NA
8 Carter 1977-01-20 1981-01-20 Democratic 4 1976
9 Bush 1989-01-20 1993-01-20 Republican 4 1988
To break ties, we can further sort by other variables
# A tibble: 12 × 6
name start end party term_length elected
<chr> <date> <date> <chr> <dbl> <dbl>
1 Clinton 1993-01-20 2001-01-20 Democratic 8 1992
2 Obama 2009-01-20 2017-01-20 Democratic 8 2008
3 Eisenhower 1953-01-20 1961-01-20 Republican 8 1952
4 Reagan 1981-01-20 1989-01-20 Republican 8 1980
5 Bush 2001-01-20 2009-01-20 Republican 8 2000
6 Nixon 1969-01-20 1974-08-09 Republican 5.55 1968
7 Johnson 1963-11-22 1969-01-20 Democratic 5.16 NA
8 Carter 1977-01-20 1981-01-20 Democratic 4 1976
9 Bush 1989-01-20 1993-01-20 Republican 4 1988
10 Trump 2017-01-20 2021-01-20 Republican 4 2016
11 Kennedy 1961-01-20 1963-11-22 Democratic 2.84 1960
12 Ford 1974-08-09 1977-01-20 Republican 2.45 NA
summarize()
with group_by()
The summarize
verb is often used with group_by
The summarize() function. At left, a data frame. At right, the resulting data frame after aggregating four of the columns.
When used without grouping, summarize()
collapses a data frame into a single row.
my_presidents %>%
dplyr::summarize(
N = n(),
first_year = min(year(start)),
last_year = max(year(end)),
num_dems = sum(party == "Democratic"),
years = sum(term_length),
avg_term_length = mean(term_length)
)
# A tibble: 1 × 6
N first_year last_year num_dems years avg_term_length
<int> <dbl> <dbl> <int> <dbl> <dbl>
1 12 1953 2021 5 68 5.67
To make comparisons, we can first group then summarize, giving us one summary row for each group.
my_presidents %>%
dplyr::group_by(party) %>%
dplyr::summarize(
N = n(),
first_year = min(year(start)),
last_year = max(year(end)),
num_dems = sum(party == "Democratic"),
years = sum(term_length),
avg_term_length = mean(term_length)
)
# A tibble: 2 × 7
party N first_year last_year num_dems years avg_term_length
<chr> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 Democratic 5 1961 2017 5 28 5.6
2 Republican 7 1953 2021 0 40 5.71
# attach package magrittr
require(magrittr)
url <-
"https://data.cityofchicago.org/api/views/5neh-572f/rows.csv?accessType=DOWNLOAD&bom=true&format=true"
all_stations <-
# Step 1: Read in the data.
readr::read_csv(url) %>%
# Step 2: select columns and rename stationname
dplyr::select(station = stationname, date, rides) %>%
# Step 3: Convert the character date field to a date encoding.
# Also, put the data in units of 1K rides
dplyr::mutate(date = lubridate::mdy(date), rides = rides / 1000) %>%
# Step 4: Summarize the multiple records using the maximum.
dplyr::group_by(date, station) %>%
dplyr::summarize(rides = max(rides), .groups = "drop")
Topic | Magrittr 2.0.3 | Base 4.3.0 |
---|---|---|
Operator | %>% %<>% %T>% |
|> (since 4.1.0) |
Function call | 1:3 %>% sum() |
1:3 |> sum() |
1:3 %>% sum |
Needs brackets / parentheses | |
1:3 %>% `+`(4) |
Some functions are not supported | |
Placeholder | . |
_ (since 4.2.0) |
based on a stackoverflow post comparing magrittr pipe to base R pipe.
# functional programming
airlines <- fivethirtyeight::airline_safety %>%
# filter rows
dplyr::filter( stringr::str_detect(airline, 'Air') )
# assignment
airlines %<>%
# filter columns and assign result to airlines
dplyr::select(avail_seat_km_per_week, incidents_85_99, fatalities_85_99)
# side effects
airlines %T>%
# report the dimensions
( \(x) print(dim(x)) ) %>%
# summarize
dplyr::summarize(avail_seat_km_per_week = sum(avail_seat_km_per_week))
# named function
is_awesome <- function(x = 'Bob') {
paste(x, 'is awesome!')
}
is_awesome('Keith')
# anonymous function
(function (x) {paste(x, 'is awesome!')})('Keith')
# also anonymous function
(\(x) paste(x, 'is awesome!'))('Keith')
# a function from a formula in the tidyverse
c('Bob','Ted') %>% purrr::map_chr(~paste(.x, 'is awesome!'))
The Tidyverse offers a consistent and efficient framework for manipulating, transforming, and cleaning datasets.
Functions like filter()
, select()
, mutate()
, and group_by()
allow users to easily subset, reorganize, add, and aggregate data, and the pipe (%>%
or |>
) enables a sequential and readable flow of operations.
The following examples show a few more of the many useful data wrangling functions in the tidyverse.
openintro::email %>%
dplyr::select(-from, -sent_email) %>%
dplyr::mutate(
day_of_week = lubridate::wday(time) # new variable: day of week
, month = lubridate::month(time) # new variable: month
) %>%
dplyr::select(-time) %>%
dplyr::mutate(
cc = cut(cc, breaks = c(0, 1)) # discretize cc
, attach = cut(attach, breaks = c(0, 1)) # discretize attach
, dollar = cut(dollar, breaks = c(0, 1)) # discretize dollar
) %>%
dplyr::mutate(
inherit =
cut(inherit, breaks = c(0, 1, 5, 10, 20)) # discretize inherit, by intervals
, password = dplyr::ntile(password, 5) # discretize password, by quintile
)
iris %>%
dplyr::mutate(across(c(Sepal.Length, Sepal.Width), round))
iris %>%
dplyr::mutate(across(c(1, 2), round))
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(
across( starts_with("Sepal"), list(mean = mean, sd = sd) )
)
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(
across( starts_with("Sepal"), ~ mean(.x, na.rm = TRUE) )
)
The verb rowwise creates a special type of grouping where each group consists of a single row.
A nested data frame is a data frame where one (or more) columns is a list of data frames.
# A tibble: 142 × 3
continent country data
<fct> <fct> <list<tibble[,5]>>
1 Africa Algeria [12 × 5]
2 Africa Angola [12 × 5]
3 Africa Benin [12 × 5]
4 Africa Botswana [12 × 5]
5 Africa Burkina Faso [12 × 5]
6 Africa Burundi [12 × 5]
7 Africa Cameroon [12 × 5]
8 Africa Central African Republic [12 × 5]
9 Africa Chad [12 × 5]
10 Africa Comoros [12 × 5]
# ℹ 132 more rows
# A tibble: 142 × 4
continent country data model
<fct> <fct> <list<tibble[,5]>> <list>
1 Africa Algeria [12 × 5] <lm>
2 Africa Angola [12 × 5] <lm>
3 Africa Benin [12 × 5] <lm>
4 Africa Botswana [12 × 5] <lm>
5 Africa Burkina Faso [12 × 5] <lm>
6 Africa Burundi [12 × 5] <lm>
7 Africa Cameroon [12 × 5] <lm>
8 Africa Central African Republic [12 × 5] <lm>
9 Africa Chad [12 × 5] <lm>
10 Africa Comoros [12 × 5] <lm>
# ℹ 132 more rows
Main verbs, each taking a pattern as input
x <- c("why", "video", "cross", "extra", "deal", "authority")
stringr::str_detect(x, "[aeiou]") # identifies any matches
stringr::str_count(x, "[aeiou]") # counts number of patterns
stringr::str_subset(x, "[aeiou]") # extracts matching components
stringr::str_extract(x, "[aeiou]") # extracts text of the match
stringr::str_replace(x, "[aeiou]", "?") # replaces matches with new text:
stringr::str_split(x, ",") # splits up a string
<SQL>
SELECT `x_bin`, COUNT(*) AS `n`
FROM (
SELECT
`temp_table`.*,
CASE
WHEN (`x` <= 0.0) THEN NULL
WHEN (`x` <= 33.0) THEN 'low'
WHEN (`x` <= 66.0) THEN 'mid'
WHEN (`x` <= 100.0) THEN 'high'
WHEN (`x` > 100.0) THEN NULL
END AS `x_bin`
FROM `temp_table`
) AS `q01`
GROUP BY `x_bin`
When some of the column names are not names of variables, but values of a variable.
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
When an observation is scattered across multiple rows.
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
We can join related tables in a variety of ways:
semi_join(x, y)
keeps all observations in x
that have a match in y
( i.e. no NAs).anti_join(x, y)
drops all observations in x
that have a match in y
.When tidy dataset x and y have the same variables, set operations work as expected:
intersect(x, y)
: return only observations in both x
and y
.union(x, y)
: return unique observations in x
and y
.setdiff(x, y)
: return observations in x
, but not in y
.tidyr::table3 %>%
tidyr::separate_wider_delim(
cols = rate
, delim = "/"
, names = c("cases", "population")
)
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Exploratory data analysis is the process of understanding a new dataset by looking at the data, constructing graphs, tables, and models. We want to understand three aspects:
each individual variable by itself;
each individual variable in the context of other, relevant, variables; and
the data that are not there.
We will perform two broad categories of EDA:
Our first dataset is a sample of categorical variables from the General Social Survey, a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
# A tibble: 6 × 9
year marital age race rincome partyid relig denom tvhours
<int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
1 2000 Never married 26 White $8000 to 9999 Ind,near r… Prot… Sout… 12
2 2000 Divorced 48 White $8000 to 9999 Not str re… Prot… Bapt… NA
3 2000 Widowed 67 White Not applicable Independent Prot… No d… 2
4 2000 Never married 39 White Not applicable Ind,near r… Orth… Not … 4
5 2000 Divorced 25 White Not applicable Not str de… None Not … 1
6 2000 Married 25 White $20000 - 24999 Strong dem… Prot… Sout… NA
Use dplyr::glimpse()
to see every column in a data.frame
Rows: 10
Columns: 9
$ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
$ marital <fct> Never married, Divorced, Widowed, Never married, Divorced, Mar…
$ age <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47
$ race <fct> White, White, White, White, White, White, White, White, White,…
$ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not applicable, …
$ partyid <fct> "Ind,near rep", "Not str republican", "Independent", "Ind,near…
$ relig <fct> Protestant, Protestant, Protestant, Orthodox-christian, None, …
$ denom <fct> "Southern baptist", "Baptist-dk which", "No denomination", "No…
$ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3
Use dplyr::slice_sample()
to see a random selection of rows in a data.frame
# A tibble: 10 × 9
year marital age race rincome partyid relig denom tvhours
<int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
1 2002 Divorced 33 Black $8000 to 9999 Ind,near … Prot… Other 4
2 2014 Married 52 White $25000 or more Strong re… Orth… Not … 0
3 2000 Divorced 63 White $25000 or more Strong re… Prot… No d… NA
4 2010 Divorced 46 White Not applicable Not str d… Prot… Sout… NA
5 2014 Married 72 White Not applicable Not str d… Prot… Unit… NA
6 2002 Never married 22 White Not applicable Independe… None Not … NA
7 2014 Married 47 White Not applicable Strong re… Prot… Wi e… NA
8 2000 Married 75 White $20000 - 24999 Independe… Prot… Othe… NA
9 2006 Married 50 White Not applicable Other par… None Not … 5
10 2006 Married 69 White Not applicable Independe… Cath… Not … 5
There are many dplyr::slice_{X} variants, along with dplyr::filter
The base R function summary()
can be used for key summary statistics of the data.
year marital age race
Min. :2000 No answer : 17 Min. :18.00 Other : 1959
1st Qu.:2002 Never married: 5416 1st Qu.:33.00 Black : 3129
Median :2006 Separated : 743 Median :46.00 White :16395
Mean :2007 Divorced : 3383 Mean :47.18 Not applicable: 0
3rd Qu.:2010 Widowed : 1807 3rd Qu.:59.00
Max. :2014 Married :10117 Max. :89.00
NA's :76
rincome partyid relig
$25000 or more:7363 Independent :4119 Protestant:10846
Not applicable:7043 Not str democrat :3690 Catholic : 5124
$20000 - 24999:1283 Strong democrat :3490 None : 3523
$10000 - 14999:1168 Not str republican:3032 Christian : 689
$15000 - 19999:1048 Ind,near dem :2499 Jewish : 388
Refused : 975 Strong republican :2314 Other : 224
(Other) :2603 (Other) :2339 (Other) : 689
denom tvhours
Not applicable :10072 Min. : 0.000
Other : 2534 1st Qu.: 1.000
No denomination : 1683 Median : 2.000
Southern baptist: 1536 Mean : 2.981
Baptist-dk which: 1457 3rd Qu.: 4.000
United methodist: 1067 Max. :24.000
(Other) : 3134 NA's :10146
The function skimr::skim()
gives an enhanced version of base R’s summary()
.
Other packages, such as DataExplorer::
rely more on graphing.
skimr::skim()
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2006.50 | 4.45 | 2000 | 2002 | 2006 | 2010 | 2014 | ▇▃▇▂▆ |
age | 76 | 1.00 | 47.18 | 17.29 | 18 | 33 | 46 | 59 | 89 | ▇▇▇▅▂ |
tvhours | 10146 | 0.53 | 2.98 | 2.59 | 0 | 1 | 2 | 4 | 24 | ▇▂▁▁▁ |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
marital | 0 | 1 | FALSE | 6 | Mar: 10117, Nev: 5416, Div: 3383, Wid: 1807 |
race | 0 | 1 | FALSE | 3 | Whi: 16395, Bla: 3129, Oth: 1959, Not: 0 |
rincome | 0 | 1 | FALSE | 16 | $25: 7363, Not: 7043, $20: 1283, $10: 1168 |
partyid | 0 | 1 | FALSE | 10 | Ind: 4119, Not: 3690, Str: 3490, Not: 3032 |
relig | 0 | 1 | FALSE | 15 | Pro: 10846, Cat: 5124, Non: 3523, Chr: 689 |
denom | 0 | 1 | FALSE | 30 | Not: 10072, Oth: 2534, No : 1683, Sou: 1536 |
Most of the columns here are factors
(categories). Use these to count the number of observations per category.
# A tibble: 16 × 2
f n
<fct> <int>
1 Protestant 10846
2 Catholic 5124
3 None 3523
4 Christian 689
5 Jewish 388
6 Other 224
7 Buddhism 147
8 Inter-nondenominational 109
9 Moslem/islam 104
10 Orthodox-christian 95
11 No answer 93
12 Hinduism 71
13 Other eastern 32
14 Native american 23
15 Don't know 15
16 Not applicable 0
# A tibble: 15 × 2
relig n
<fct> <int>
1 Protestant 10846
2 Catholic 5124
3 None 3523
4 Christian 689
5 Jewish 388
6 Other 224
7 Buddhism 147
8 Inter-nondenominational 109
9 Moslem/islam 104
10 Orthodox-christian 95
11 No answer 93
12 Hinduism 71
13 Other eastern 32
14 Native american 23
15 Don't know 15
Var1 Freq
1 Protestant 10846
2 Catholic 5124
3 None 3523
4 Christian 689
5 Jewish 388
6 Other 224
7 Buddhism 147
8 Inter-nondenominational 109
9 Moslem/islam 104
10 Orthodox-christian 95
11 No answer 93
12 Hinduism 71
13 Other eastern 32
14 Native american 23
15 Don't know 15
16 Not applicable 0
is_protestant
partyid 0 1
No answer 102 52
Don't know 1 0
Other party 233 160
Strong republican 720 1594
Not str republican 1198 1834
Ind,near rep 878 913
Independent 2436 1683
Ind,near dem 1473 1026
Not str democrat 1972 1718
Strong democrat 1624 1866
There are three main categories of missing data
We can think of a few options for dealing with missing data
Drop observations with missing data.
Impute the mean of observations without missing data.
Use multiple imputation.
Note
Multiple imputation involves generating several estimates for the missing values and then averaging the outcomes.
dat %>% dplyr::select(partyid) %>% table() %>% tibble::as_tibble() %>%
dplyr::left_join(
dat %>% dplyr::filter(is.na(age)) %>%
dplyr::select(na_partyid = partyid) %>% table() %>% tibble::as_tibble()
, by = c("partyid" = "na_partyid")
, suffix = c("_partyid", "_na_partyid")
) %>%
dplyr::mutate(pct_na = n_na_partyid / n_partyid)
# A tibble: 10 × 4
partyid n_partyid n_na_partyid pct_na
<chr> <int> <int> <dbl>
1 No answer 154 9 0.0584
2 Don't know 1 0 0
3 Other party 393 3 0.00763
4 Strong republican 2314 8 0.00346
5 Not str republican 3032 8 0.00264
6 Ind,near rep 1791 2 0.00112
7 Independent 4119 18 0.00437
8 Ind,near dem 2499 2 0.000800
9 Not str democrat 3690 11 0.00298
10 Strong democrat 3490 15 0.00430
Customer ID | Age | Income | Purchase Frequency | Satisfaction Rating |
---|---|---|---|---|
1 | 35 | $60,000 | High | 8 |
2 | 28 | $45,000 | Low | - |
3 | 42 | $70,000 | Medium | 7 |
4 | 30 | $50,000 | Low | - |
5 | 55 | $80,000 | High | 9 |
6 | 26 | $40,000 | Low | - |
7 | 50 | $75,000 | Medium | 8 |
8 | 29 | $48,000 | Low | - |
Employee ID | Age | Job Tenure | Performance Score | Promotion Status |
---|---|---|---|---|
1 | 30 | 5 years | 85 | Yes |
2 | 45 | 10 years | - | No |
3 | 28 | 3 years | 90 | Yes |
4 | 50 | 12 years | 70 | No |
5 | 35 | 6 years | - | No |
6 | 32 | 4 years | 95 | Yes |
7 | 40 | 8 years | - | No |
8 | 29 | 2 years | 88 | Yes |
Finally, be aware that how missing data is encoded depends on the dataset
Feature engineering is the act of converting raw observations into desired features using statistical, mathematical, or machine learning approaches.
for continuous variables (usually the independent variables or covariates):
Other common transformations:
\[ x_i(\lambda) = \left\{ \begin{array}{cc} \frac{x_i^{\lambda}-1}{\lambda\tilde{x}^{\lambda-1}} & \lambda\ne 0\\ \tilde{x}\log x_i & \lambda=0 \end{array} \right . \]
Note
Box-cox is an example of a power transform; it is a technique used to stabilize variance, make the data more normal distribution-like.
One last common transformation:
\[ \text{logit}\left(p\right)=\log\frac{p}{1-p},\;p\in [0,1] \]
Note
The Logit transform is primarily used to transform binary response data, such as survival/non-survival or present/absent, to provide a continuous value in the range \(\left(-\infty,\infty\right)\), where p is the proportion of each sample that is 1 (or 0)
Why normalize or standardize?
for continuous variables (usually the target variables):
for categorical variables (either target or explanatory variables):
for date variables:
Why transform?
it can make your model perform better
requires an advanced technical skill set
requires domain expertise
is time-consuming and resource intensive
different analytics algorithms require different feature engineering
Today we have introduced tidy data, tidyverse verbs and the pipe operator.
We briefly discussed EDA
In the lab we will introduce Git and the data backup workflows.