はじめに

deepblueインターン生の中山です。
【集計、ソート、縦持ち・横持ち変換】PythonとRのデータフレーム操作比較 vol.3の続きです。
列分割・NA操作・サンプリングについて、tidyverseを用いたRとPythonの比較をしたいと思います。

連載

利用したデータ

使ったデータは、MineThatData E-Mail Analytics And Data Mining Challenge datasetです。
前回のblogとデータは同じです。

データ読み込み

Python

import pandas as pd
PATH = "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv";
df = pd.read_csv(PATH)

R

library(tidyverse)
PATH <- "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"
df <- read.csv(PATH)

列の分割

正規表現を用いて1つの列を複数の列に分割するコード。
history_segmentsegmentの列をそれぞれ分割しました。
hitory_segmentにはエスケープの問題がありますが、その説明は割愛します。

Python

df1 = pd.DataFrame(df.history_segment.str.split("\) ").tolist(),
                   columns=["hs1", "hs2"])
df2 = pd.DataFrame(df.segment.str.split().tolist(),
                   columns=["seg1", "seg2"])
df_new = pd.concat([df, df1, df2], axis=1)
df_new = df_new.drop(["history_segment", "segment"], axis=1)

# df.head()
#    recency history_segment  history  mens  womens   zip_code  newbie channel        segment  visit  conversion  spend
# 0       10  2) $100 - $200   142.44     1       0  Surburban       0   Phone  Womens E-Mail      0           0    0.0
# 1        6  3) $200 - $350   329.08     1       1      Rural       1     Web      No E-Mail      0           0    0.0
# 2        7  2) $100 - $200   180.65     0       1  Surburban       1     Web  Womens E-Mail      0           0    0.0
# 3        9  5) $500 - $750   675.83     1       0      Rural       1     Web    Mens E-Mail      0           0    0.0
# 4        2    1) $0 - $100    45.34     1       0      Urban       0     Web  Womens E-Mail      0           0    0.0

# df_new.head()
#    recency  history  mens  womens   zip_code  newbie channel  visit  conversion  spend hs1          hs2    seg1    seg2
# 0       10   142.44     1       0  Surburban       0   Phone      0           0    0.0   2  $100 - $200  Womens  E-Mail
# 1        6   329.08     1       1      Rural       1     Web      0           0    0.0   3  $200 - $350      No  E-Mail
# 2        7   180.65     0       1  Surburban       1     Web      0           0    0.0   2  $100 - $200  Womens  E-Mail
# 3        9   675.83     1       0      Rural       1     Web      0           0    0.0   5  $500 - $750    Mens  E-Mail
# 4        2    45.34     1       0      Urban       0     Web      0           0    0.0   1    $0 - $100  Womens  E-Mail

R

df_new <- df %>%
  separate(history_segment, c("hs1", "hs2"), "[)] ") %>%
  separate(segment, c("seg1", "seg2"), " ")

# df %>% head()
#   recency history_segment history mens womens  zip_code newbie channel       segment visit conversion spend
# 1      10  2) $100 - $200  142.44    1      0 Surburban      0   Phone Womens E-Mail     0          0     0
# 2       6  3) $200 - $350  329.08    1      1     Rural      1     Web     No E-Mail     0          0     0
# 3       7  2) $100 - $200  180.65    0      1 Surburban      1     Web Womens E-Mail     0          0     0
# 4       9  5) $500 - $750  675.83    1      0     Rural      1     Web   Mens E-Mail     0          0     0
# 5       2    1) $0 - $100   45.34    1      0     Urban      0     Web Womens E-Mail     0          0     0
# 6       6  2) $100 - $200  134.83    0      1 Surburban      0   Phone Womens E-Mail     1          0     0

# df_new %>% head()
#   recency hs1         hs2 history mens womens  zip_code newbie channel   seg1   seg2 visit conversion spend
# 1      10   2 $100 - $200  142.44    1      0 Surburban      0   Phone Womens E-Mail     0          0     0
# 2       6   3 $200 - $350  329.08    1      1     Rural      1     Web     No E-Mail     0          0     0
# 3       7   2 $100 - $200  180.65    0      1 Surburban      1     Web Womens E-Mail     0          0     0
# 4       9   5 $500 - $750  675.83    1      0     Rural      1     Web   Mens E-Mail     0          0     0
# 5       2   1   $0 - $100   45.34    1      0     Urban      0     Web Womens E-Mail     0          0     0
# 6       6   2 $100 - $200  134.83    0      1 Surburban      0   Phone Womens E-Mail     1          0     0

NAの削除

データフレームの中にNAがある時に行削除をするコード。
コードの前半では、上記で説明した列の分割を利用して敢えてNAを作っています。
これは、history_segmentの列に7) $1,000 +が1308個含まれているためです。
行数が64000から、62692になっていることを確認できると思います。

Python     R
dropna()  :  drop_na()

Python

df1 = pd.DataFrame(df.history_segment.str.split().tolist(), columns=["hs1","hs2","hs3","hs4"])
df_na = pd.concat([df, df1], axis=1)
df_na = df_na.drop("history_segment", axis=1)
df_new = df_na.dropna()

# df_na.iloc[67:72,]
#     recency  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend hs1     hs2 hs3   hs4
# 67       11    53.81     1       0      Urban       1         Phone      No E-Mail      0           0    0.0  1)      $0   -  $100
# 68        9   154.15     0       1  Surburban       0           Web    Mens E-Mail      0           0    0.0  2)    $100   -  $200
# 69       10  1009.44     1       0  Surburban       1  Multichannel  Womens E-Mail      0           0    0.0  7)  $1,000   +  None
# 70        2   278.80     1       0      Rural       0           Web    Mens E-Mail      0           0    0.0  3)    $200   -  $350
# 71       10    31.12     1       0      Urban       0         Phone  Womens E-Mail      0           0    0.0  1)      $0   -  $100
#
# print(len(df_na))
# 62692

# df_new.iloc[67:72,]
#     recency  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend hs1     hs2 hs3   hs4
# 67       11    53.81     1       0      Urban       1         Phone      No E-Mail      0           0    0.0  1)      $0   -  $100
# 68        9   154.15     0       1  Surburban       0           Web    Mens E-Mail      0           0    0.0  2)    $100   -  $200
# 70        2   278.80     1       0      Rural       0           Web    Mens E-Mail      0           0    0.0  3)    $200   -  $350
# 71       10    31.12     1       0      Urban       0         Phone  Womens E-Mail      0           0    0.0  1)      $0   -  $100
# 72        2   428.74     1       0      Rural       0        Phone     Mens E-Mail      0           0    0.0  4)    $350   -  $500
#
# print(len(df_new))
# 64000

R

df_na <- df %>%
  separate(history_segment, c("hs1","hs2","hs3","hs4"), " ", convert=T)
df_new <- df_na %>%
  drop_na()

# df_na[68:72,]
#    recency hs1    hs2 hs3  hs4 history mens womens  zip_code newbie      channel       segment visit conversion spend
# 68      11  1)     $0   - $100   53.81    1      0     Urban      1        Phone     No E-Mail     0          0     0
# 69       9  2)   $100   - $200  154.15    0      1 Surburban      0          Web   Mens E-Mail     0          0     0
# 70      10  7) $1,000   + <NA> 1009.44    1      0 Surburban      1 Multichannel Womens E-Mail     0          0     0
# 71       2  3)   $200   - $350  278.80    1      0     Rural      0          Web   Mens E-Mail     0          0     0
# 72      10  1)     $0   - $100   31.12    1      0     Urban      0        Phone Womens E-Mail     0          0     0
#
# nrow(df_na)
# 64000

# df_new[68:72,]
#    recency hs1  hs2 hs3  hs4 history mens womens  zip_code newbie channel       segment visit conversion spend
# 68      11  1)   $0   - $100   53.81    1      0     Urban      1   Phone     No E-Mail     0          0     0
# 69       9  2) $100   - $200  154.15    0      1 Surburban      0     Web   Mens E-Mail     0          0     0
# 71       2  3) $200   - $350  278.80    1      0     Rural      0     Web   Mens E-Mail     0          0     0
# 72      10  1)   $0   - $100   31.12    1      0     Urban      0   Phone Womens E-Mail     0          0     0
# 73       2  4) $350   - $500  428.74    1      0     Rural      0   Phone   Mens E-Mail     0          0     0
#
# nrow(df_new)
# 62692

NAの置換

NAの削除をするコード。
NAを作り出すまでは、上記「NAの削除」と同様です。
「NAの削除」では、NAの削除を行っていますが、似たようなコードでNAを埋めることもできます。

Python     R
fillna(value={"hoge": "fuga"})  :  replace_na(list(hoge = "fuga"))

Python

df1 = pd.DataFrame(df.history_segment.str.split().tolist(), columns=["hs1","hs2","hs3","hs4"])
df_na = pd.concat([df,df1],axis=1)
df_na = df_na.drop("history_segment", axis=1)
df_new = df_na.fillna(value={"hs4":"$1000"})

# display(df_na.iloc[67:72,])
#     recency  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend hs1     hs2 hs3   hs4
# 67       11    53.81     1       0      Urban       1         Phone      No E-Mail      0           0    0.0  1)      $0   -  $100
# 68        9   154.15     0       1  Surburban       0           Web    Mens E-Mail      0           0    0.0  2)    $100   -  $200
# 69       10  1009.44     1       0  Surburban       1  Multichannel  Womens E-Mail      0           0    0.0  7)  $1,000   +  None
# 70        2   278.80     1       0      Rural       0           Web    Mens E-Mail      0           0    0.0  3)    $200   -  $350
# 71       10    31.12     1       0      Urban       0         Phone  Womens E-Mail      0           0    0.0  1)      $0   -  $100
#
# print(len(df))
# 64000

# display(df_new.iloc[67:72,])
#     recency  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend hs1     hs2 hs3    hs4
# 67       11    53.81     1       0      Urban       1         Phone      No E-Mail      0           0    0.0  1)      $0   -   $100
# 68        9   154.15     0       1  Surburban       0           Web    Mens E-Mail      0           0    0.0  2)    $100   -   $200
# 69       10  1009.44     1       0  Surburban       1  Multichannel  Womens E-Mail      0           0    0.0  7)  $1,000   +  $1000
# 70        2   278.80     1       0      Rural       0           Web    Mens E-Mail      0           0    0.0  3)    $200   -   $350
# 71       10    31.12     1       0      Urban       0         Phone  Womens E-Mail      0           0    0.0  1)      $0   -   $100
#
# print(len(df_new))
# 64000

R

df_na <- df %>%
  separate(history_segment, c("hs1","hs2","hs3","hs4"), " ", convert=T)
df_new <- df_na %>%
  replace_na(list(hs4="$1000"))

# df_na[68:72,]
#    recency hs1    hs2 hs3  hs4 history mens womens  zip_code newbie      channel       segment visit conversion spend
# 68      11  1)     $0   - $100   53.81    1      0     Urban      1        Phone     No E-Mail     0          0     0
# 69       9  2)   $100   - $200  154.15    0      1 Surburban      0          Web   Mens E-Mail     0          0     0
# 70      10  7) $1,000   + <NA> 1009.44    1      0 Surburban      1 Multichannel Womens E-Mail     0          0     0
# 71       2  3)   $200   - $350  278.80    1      0     Rural      0          Web   Mens E-Mail     0          0     0
# 72      10  1)     $0   - $100   31.12    1      0     Urban      0        Phone Womens E-Mail     0          0     0
#
# nrow(df_na)
# 64000

# df_new[68:72,]
#    recency hs1    hs2 hs3   hs4 history mens womens  zip_code newbie      channel       segment visit conversion spend
# 68      11  1)     $0   -  $100   53.81    1      0     Urban      1        Phone     No E-Mail     0          0     0
# 69       9  2)   $100   -  $200  154.15    0      1 Surburban      0          Web   Mens E-Mail     0          0     0
# 70      10  7) $1,000   + $1000 1009.44    1      0 Surburban      1 Multichannel Womens E-Mail     0          0     0
# 71       2  3)   $200   -  $350  278.80    1      0     Rural      0          Web   Mens E-Mail     0          0     0
# 72      10  1)     $0   -  $100   31.12    1      0     Urban      0        Phone Womens E-Mail     0          0     0
#
# nrow(df_new)
# 64000

サンプリング

行のサンプリングをするコード。
データフレームの中からいくつかの行を抽出します。
やり方は主に2種類あり、
・抽出する個数を選択する
・抽出する割合を選択する
のどちらかです。

Python     R
sample(a)  :  sample_n(a)
sample(frac = a)  :  sample_frac(a)

Python

df1 = df.sample(64) # 個数
df2 = df.sample(frac=0.001) # 割合

# df1.head()
#        recency   history_segment  history  mens  womens   zip_code  newbie  channel        segment  visit  conversion  spend
# 57905        4  6) $750 - $1,000   799.47     0       1      Urban       1      Web  Womens E-Mail      0           0    0.0
# 51278        7      1) $0 - $100    83.27     1       0      Urban       0    Phone      No E-Mail      0           0    0.0
# 12856       10    4) $350 - $500   362.40     1       1      Urban       0      Web      No E-Mail      0           0    0.0
# 2306        11    2) $100 - $200   146.22     0       1  Surburban       1    Phone  Womens E-Mail      0           0    0.0
# 6540        10      1) $0 - $100    93.61     1       0      Urban       1    Phone  Womens E-Mail      0           0    0.0
#
# df1.shape
# (64, 12)

# df2.head()
#        recency history_segment  history  mens  womens   zip_code  newbie  channel        segment  visit  conversion  spend
# 27973        5    1) $0 - $100    37.82     0       1      Urban       1      Web    Mens E-Mail      0           0    0.0
# 31988       12  2) $100 - $200   129.14     0       1      Urban       0      Web  Womens E-Mail      0           0    0.0
# 12988        2  2) $100 - $200   138.71     0       1  Surburban       0      Web      No E-Mail      1           0    0.0
# 21122        4  3) $200 - $350   345.04     1       0  Surburban       0    Phone    Mens E-Mail      0           0    0.0
# 17095        2    1) $0 - $100    39.85     1       0      Rural       0    Phone  Womens E-Mail      0           0    0.0
#
# df2.shape
# (64, 12)

R

df1 <- df %>%
  sample_n(64) # 個数
df2 <- df %>%
  sample_frac(0.001) # 割合

# df1 %>% head()
#   recency history_segment history mens womens  zip_code newbie      channel       segment visit conversion spend
# 1      10    1) $0 - $100   50.93    1      0 Surburban      1        Phone   Mens E-Mail     0          0     0
# 2       4  2) $100 - $200  124.94    0      1     Rural      0          Web   Mens E-Mail     0          0     0
# 3      10  5) $500 - $750  686.45    1      1 Surburban      1 Multichannel   Mens E-Mail     0          0     0
# 4       3     7) $1,000 + 1829.32    1      1     Rural      1 Multichannel   Mens E-Mail     0          0     0
# 5       1  3) $200 - $350  255.71    1      0 Surburban      0 Multichannel Womens E-Mail     0          0     0
# 6       5    1) $0 - $100   29.99    1      0 Surburban      1          Web   Mens E-Mail     1          0     0
#
# dim(df1)
# 64 12

# df2 %>% head()
#   recency history_segment history mens womens  zip_code newbie      channel       segment visit conversion spend
# 1       6    1) $0 - $100   29.99    1      0 Surburban      1          Web     No E-Mail     0          0     0
# 2      12  2) $100 - $200  125.92    0      1     Urban      1          Web Womens E-Mail     0          0     0
# 3       8  2) $100 - $200  157.14    0      1     Rural      0        Phone   Mens E-Mail     0          0     0
# 4      11  3) $200 - $350  284.96    0      1 Surburban      0 Multichannel   Mens E-Mail     1          0     0
# 5       2  4) $350 - $500  412.44    1      0     Rural      0          Web     No E-Mail     0          0     0
# 6       7    1) $0 - $100   49.97    1      0     Urban      0          Web Womens E-Mail     0          0     0
#
# dim(df2)
# 64 12

連載