1. 導入編

はじめに

データフレームを縦長・横長・入れ子に変形・整形するためのツール
reshape2を置き換えるべく再設計された改良版(Ｃ++での記述)
dplyrやpurrr と一緒に使うことで真価を発揮する
Rstudio社が提供する{tidyverse}に含まれる

登録関数

# 登録数
funcCountFuncs(tidyr)

## [1] 41

# 関数内訳
funcCountFuncs(tidyr, FALSE, TRUE)

##  [1] "%>%"             "complete"        "complete_"      
##  [4] "crossing"        "crossing_"       "drop_na"        
##  [7] "drop_na_"        "expand"          "expand_"        
## [10] "extract"         "extract_"        "extract_numeric"
## [13] "fill"            "fill_"           "full_seq"       
## [16] "gather"          "gather_"         "nest"           
## [19] "nest_"           "nesting"         "nesting_"       
## [22] "population"      "replace_na"      "separate"       
## [25] "separate_"       "separate_rows"   "separate_rows_" 
## [28] "smiths"          "spread"          "spread_"        
## [31] "table1"          "table2"          "table3"         
## [34] "table4a"         "table4b"         "table5"         
## [37] "unite"           "unite_"          "unnest"         
## [40] "unnest_"         "who"

参考資料

2. データの横持ち/縦持ち

gather関数

縦持ちデータへの変換とは、基準列と他の列のフィールド名でＤＢ化することをいう
- reshape2のmelt関数に相当する操作
基準列には文字列のフィールドを選択する
key引数を指定すると、他の列のフィールド名が入る列の列名を定義できる
value引数を指定すると、他の列の値が入る列の列名を定義できる
key引数とvalue引数の別名の定義は必須ではないが、入力の省略はできない
...にはまとめる引数を指定する(-Species)

# 元データのフィールド数
iris %>% {
  head(.) %>% print()
  dim(.) %>% print()
}

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## [1] 150   5

# 縦持ちに変換 (列名変更なし)
iris %>% 
  gather(key, value, -Species) %>% {
    head(.) %>% print()
    dim(.) %>% print()
  }

##   Species          key value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4
## [1] 600   3

# 縦持ちに変換 (列名変更あり)
iris %>% 
  gather(key = KEY, value = VALUE, -Species) %>% {
    head(.) %>% print()
    dim(.) %>% print()
  }

##   Species          KEY VALUE
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4
## [1] 600   3

spread関数

縦持ちデータを横持ちデータに変更する
reshape2のdcast関数に相当する
この関数を使うときにはIDとなるような変数が必要で、ここではid列を作っている
関数で指定するのはkey引数とvalue引数のみ

# 準備：irisを縦持ちデータに変換
iris.l <- iris %>% 
            mutate(id = rownames(.)) %>% 
            gather(key, value , contains("l.")) %T>% {
              head(.) %>% print()  
            }

##   Species id          key value
## 1  setosa  1 Sepal.Length   5.1
## 2  setosa  2 Sepal.Length   4.9
## 3  setosa  3 Sepal.Length   4.7
## 4  setosa  4 Sepal.Length   4.6
## 5  setosa  5 Sepal.Length   5.0
## 6  setosa  6 Sepal.Length   5.4

# 横持データに変換
iris.l %>% spread(key, value) %>% head()

##   Species id Petal.Length Petal.Width Sepal.Length Sepal.Width
## 1  setosa  1          1.4         0.2          5.1         3.5
## 2  setosa 10          1.5         0.1          4.9         3.1
## 3  setosa 11          1.5         0.2          5.4         3.7
## 4  setosa 12          1.6         0.2          4.8         3.4
## 5  setosa 13          1.4         0.1          4.8         3.0
## 6  setosa 14          1.1         0.1          4.3         3.0

# グループごとの平均値を算出
iris.mean <- 
  iris.l %>% 
  group_by(Species, key) %>% 
  summarise(mean = mean(value)) %T>% {
    head(.) %>% print()
  }

## Source: local data frame [6 x 3]
## Groups: Species [2]
## 
##      Species          key  mean
##       <fctr>        <chr> <dbl>
## 1     setosa Petal.Length 1.462
## 2     setosa  Petal.Width 0.246
## 3     setosa Sepal.Length 5.006
## 4     setosa  Sepal.Width 3.428
## 5 versicolor Petal.Length 4.260
## 6 versicolor  Petal.Width 1.326

3. NAの処理

fill関数

NAを直前/直後の値で補完する
{xts}などにも類似処理があるが、fill関数はデータフレームに対してNA補完を行う

# データセットの作成
df <- data.frame(Month = 1:5, 
                 Year = c(2000, rep(NA, 4))) %T>% print()

##   Month Year
## 1     1 2000
## 2     2   NA
## 3     3   NA
## 4     4   NA
## 5     5   NA

# 下方向にデータを補完
df %>% fill(Year)

##   Month Year
## 1     1 2000
## 2     2 2000
## 3     3 2000
## 4     4 2000
## 5     5 2000

# 上方向にデータを補完
df %>% fill(Year, .direction = "up")

##   Month Year
## 1     1 2000
## 2     2   NA
## 3     3   NA
## 4     4   NA
## 5     5   NA

replace_na関数

レコードごとにNAの置換方法を指定する関数
第２引数をlistクラスで与えている点に注意

# 準備：データセットの作成
df <- data_frame(x = c(1, 2, NA), 
                 y = c("a", NA, "b")) %T>% print()

## # A tibble: 3 × 2
##       x     y
##   <dbl> <chr>
## 1     1     a
## 2     2  <NA>
## 3    NA     b

# 数値と文字列のそれぞれで補完方法を指定
df %>% replace_na(list(x = 0, y = "unknown"))

## # A tibble: 3 × 2
##       x       y
##   <dbl>   <chr>
## 1     1       a
## 2     2 unknown
## 3     0       b

# 数値のみ補完方法を指定
df %>% replace_na(list(x = 0))

## # A tibble: 3 × 2
##       x     y
##   <dbl> <chr>
## 1     1     a
## 2     2  <NA>
## 3     0     b

drop_na関数

NAを含むレコードを削除する
全体指定の場合はna.omit関数やdata[complete.case(data),]と大差ない
フィールド指定でNAが削除できる
数値と文字列の両方のNAに対応している

# 準備：データセットのサック製
df <- data.frame(x = c(1, 2, NA), 
                 y = c("a", NA, "b")) %>% print()

##    x    y
## 1  1    a
## 2  2 <NA>
## 3 NA    b

# 全フィールドでNAを含まないレコードを抽出
df %>% drop_na()

##   x y
## 1 1 a

# 指定フィールドでNAを含まないレコードを抽出
df %>% drop_na(x)

##   x    y
## 1 1    a
## 2 2 <NA>

df %>% drop_na(y)

##    x y
## 1  1 a
## 3 NA b

4. 列の結合/分解

unite関数 / separate関数

unite()は、複数列の情報を1列にまとめたい場合に使用する
- col引数は統合フィールドのフィールド名
- ...は統合するフィールド (start_with()などを使うと便利)
- sepは結合時のセパレータ
separete()は、unite()と逆の操作

# 複数列の情報を1列にまとめて新しい列を定義
iris_x <- iris %>% 
            unite(col = "Sepal.", starts_with("Sepal"), sep = "-") %>% 
            unite(col = "Petal.", starts_with("Petal"), sep = "-") %>% 
            head() %T>% print()

##    Sepal.  Petal. Species
## 1 5.1-3.5 1.4-0.2  setosa
## 2   4.9-3 1.4-0.2  setosa
## 3 4.7-3.2 1.3-0.2  setosa
## 4 4.6-3.1 1.5-0.2  setosa
## 5   5-3.6 1.4-0.2  setosa
## 6 5.4-3.9 1.7-0.4  setosa

# 複数の情報が集約された列を分解
iris_x %>% 
  separate(col = "Sepal.", into = c("Sepal.Length","Sepal.Width"), sep = "-") %>% 
  separate(col = "Petal.", into = c("Petal.Length","Petal.Width"), sep = "-")

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9           3          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5            5         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

extract関数

1つの列が複数の情報を持つ場合に、正規表現で前後を区別して列を分解する

df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
df %>% extract(x, "A")

##      A
## 1 <NA>
## 2    a
## 3    a
## 4    b
## 5    d

df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)")

##      A    B
## 1 <NA> <NA>
## 2    a    b
## 3    a    d
## 4    b    c
## 5    d    e

5. 全要素パターン作成

expand関数 / complete関数

フィールドの因子のすべての組合せを作成
- expand関数は指定したフィールドのみ取得
- complete関数は他のレコードも併せて取得
属性値のように、重複する値が存在するフィールドを指定するのがセオリー

# 準備：データセットの作成
df <- data.frame(groupid = c("A", "A", "A", "B", "B", "B", "A"), 
                 itemid = c(1:6, 6), 
                 value = c(3, 2, 1, 2, 3, 1, 22)) %T>% print()

##   groupid itemid value
## 1       A      1     3
## 2       A      2     2
## 3       A      3     1
## 4       B      4     2
## 5       B      5     3
## 6       B      6     1
## 7       A      6    22

# 指定したフィールドのみ出力
df %>% expand(groupid, itemid)

## # A tibble: 12 × 2
##    groupid itemid
##     <fctr>  <dbl>
## 1        A      1
## 2        A      2
## 3        A      3
## 4        A      4
## 5        A      5
## 6        A      6
## 7        B      1
## 8        B      2
## 9        B      3
## 10       B      4
## 11       B      5
## 12       B      6

# 全フィールドを出力
df %>% complete(groupid, itemid)

## # A tibble: 12 × 3
##    groupid itemid value
##     <fctr>  <dbl> <dbl>
## 1        A      1     3
## 2        A      2     2
## 3        A      3     1
## 4        A      4    NA
## 5        A      5    NA
## 6        A      6    22
## 7        B      1    NA
## 8        B      2    NA
## 9        B      3    NA
## 10       B      4     2
## 11       B      5     3
## 12       B      6     1

nesting関数 / crossing関数

フィールドの因子のすべての組合せを作成
- expand関数は単純に各引数の組み合わせを取得
- complete関数はexpand.grid()と同じく直積集合を取得 -引数はベクトル ( expand()はデータフレームを引数とした)

# 指定パターンのみ作成
nesting(x = 1:3, y = 3:1)

## # A tibble: 3 × 2
##       x     y
##   <int> <int>
## 1     1     3
## 2     2     2
## 3     3     1

# 全パターンを作成
crossing(x = 1:3, y = 3:1)

## # A tibble: 9 × 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     1     2
## 3     1     3
## 4     2     1
## 5     2     2
## 6     2     3
## 7     3     1
## 8     3     2
## 9     3     3

(参考) nesting()はexpand()の中で使うと便利

# 準備：データフレーム作成
df <- data.frame(x = 1:3, y = 3:1, z = c(1, 2, 1)) %T>% print()

##   x y z
## 1 1 3 1
## 2 2 2 2
## 3 3 1 1

# 全パターンの作成
df %>% expand(x, y, z)

## # A tibble: 18 × 3
##        x     y     z
##    <int> <int> <dbl>
## 1      1     1     1
## 2      1     1     2
## 3      1     2     1
## 4      1     2     2
## 5      1     3     1
## 6      1     3     2
## 7      2     1     1
## 8      2     1     2
## 9      2     2     1
## 10     2     2     2
## 11     2     3     1
## 12     2     3     2
## 13     3     1     1
## 14     3     1     2
## 15     3     2     1
## 16     3     2     2
## 17     3     3     1
## 18     3     3     2

# 特定の組合せを固定してパターン作成
df %>% expand(nesting(x, y), z)

## # A tibble: 6 × 3
##       x     y     z
##   <int> <int> <dbl>
## 1     1     3     1
## 2     1     3     2
## 3     2     2     1
## 4     2     2     2
## 5     3     1     1
## 6     3     1     2

6. データのネスト化

nest関数 / unnest関数

データフレームをグループごとにネスト化する
- ネスト化する列を...で指定
- グループ化するカラムをマイナス(-)を付けて指定

# いきなりネスト化
ndf <- iris %>% nest(-Species) %T>% print()

## # A tibble: 3 × 2
##      Species               data
##       <fctr>             <list>
## 1     setosa <tibble [50 × 4]>
## 2 versicolor <tibble [50 × 4]>
## 3  virginica <tibble [50 × 4]>

# グループ化してからネスト化
iris %>% group_by(Species) %>% nest()

## # A tibble: 3 × 2
##      Species               data
##       <fctr>             <list>
## 1     setosa <tibble [50 × 4]>
## 2 versicolor <tibble [50 × 4]>
## 3  virginica <tibble [50 × 4]>

# ネスト化の解除
ndf %>% unnest()

## # A tibble: 150 × 5
##    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##     <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
## 1   setosa          5.1         3.5          1.4         0.2
## 2   setosa          4.9         3.0          1.4         0.2
## 3   setosa          4.7         3.2          1.3         0.2
## 4   setosa          4.6         3.1          1.5         0.2
## 5   setosa          5.0         3.6          1.4         0.2
## 6   setosa          5.4         3.9          1.7         0.4
## 7   setosa          4.6         3.4          1.4         0.3
## 8   setosa          5.0         3.4          1.5         0.2
## 9   setosa          4.4         2.9          1.4         0.2
## 10  setosa          4.9         3.1          1.5         0.1
## # ... with 140 more rows

7. その他の関数

full_seq関数

指定したベクトルの最大値と最小値の間で数列を作成する
数値を補完したベクトルを作成する意味がある

# 準備：ベクトル作成
vec <- c(1, 2, 4, 5, 10)

# 連続数値のベクトルの作成
vec %>% full_seq(1)

##  [1]  1  2  3  4  5  6  7  8  9 10

# こちらも同様
vec %>% {seq(min(.), max(.), 1)}

##  [1]  1  2  3  4  5  6  7  8  9 10

tidyrの使い方

1. 導入編

はじめに

登録関数

参考資料

2. データの横持ち/縦持ち

gather関数

spread関数

3. NAの処理

fill関数

replace_na関数

drop_na関数

4. 列の結合/分解

unite関数 / separate関数

extract関数

5. 全要素パターン作成

expand関数 / complete関数

nesting関数 / crossing関数

(参考) nesting()はexpand()の中で使うと便利

6. データのネスト化

nest関数 / unnest関数

7. その他の関数

full_seq関数