【まとめ】PythonのpandasとRのdplyr・tidyverseに関するデータフレーム操作コード比較

2021.08.12

【まとめ】PythonのpandasとRのdplyr・tidyverseに関するデータフレーム操作コード比較

dataframe dplyr pandas python R tidyverse

はじめに

deepblueインターン生の中山です。これまでに、PythonとRのデータフレーム処理について複数記事を書いてきました。一方で、それぞれの操作がバラバラの記事になってしまい読みづらいため、本記事ではその内容をまとめます。内容は過去の記事とだいたい同じなので、過去の記事を参照していただいて問題ないです。また、本記事でご紹介している操作はあくまで一例で、他にも同様の操作を行うことが可能です。

過去記事

利用したデータ

今回使用したデータは、MineThatData E-Mail Analytics And Data Mining Challenge datasetです。
効果検証入門内で解析されているデータセットです。
データの詳細は今回は重要ではないので割愛しますが、興味のある方はWEBサイトからデータを確認してみてください。
また本記事では、history > 3000という条件を付与している事が多いです。この条件は、データ数を減らし、確認をしやすくするための条件で深い意図はないです。

簡単な操作

コードの具体的な説明をする前に、PythonとRで異なる表記やショートカットの差異について記述致します。

Python (Jupyter Notebook)

代入：=
コメントアウト：ctrl + /
やり直し：ctrl + y

R (Rstudio)

代入：<-
コメントアウト：shift + ctrl + c
やり直し：shift + ctrl + z
パイプ演算子(%>%)：shift + ctrl + m

データ読み込み

csvデータを読み込んで、データフレームを確認するコード。

Python

pd.read_csv()

import pandas as pd
PATH = "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"
df = pd.read_csv(PATH)
df.head()

#   recency history_segment history mens womens  zip_code newbie channel       segment visit conversion spend
# 0      10  2) $100 - $200  142.44    1      0 Surburban      0   Phone Womens E-Mail     0          0     0
# 1       6  3) $200 - $350  329.08    1      1     Rural      1     Web     No E-Mail     0          0     0
# 2       7  2) $100 - $200  180.65    0      1 Surburban      1     Web Womens E-Mail     0          0     0
# 3       9  5) $500 - $750  675.83    1      0     Rural      1     Web   Mens E-Mail     0          0     0
# 4       2    1) $0 - $100   45.34    1      0     Urban      0     Web Womens E-Mail     0          0     0

R

read.csv()

PATH <- "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"
df <- read.csv(PATH)
head(df)

#   recency history_segment history mens womens  zip_code newbie channel       segment visit conversion spend
# 1      10  2) $100 - $200  142.44    1      0 Surburban      0   Phone Womens E-Mail     0          0     0
# 2       6  3) $200 - $350  329.08    1      1     Rural      1     Web     No E-Mail     0          0     0
# 3       7  2) $100 - $200  180.65    0      1 Surburban      1     Web Womens E-Mail     0          0     0
# 4       9  5) $500 - $750  675.83    1      0     Rural      1     Web   Mens E-Mail     0          0     0
# 5       2    1) $0 - $100   45.34    1      0     Urban      0     Web Womens E-Mail     0          0     0
# 6       6  2) $100 - $200  134.83    0      1 Surburban      0   Phone Womens E-Mail     1          0     0

行数・列数の確認

データの行数、列数を確認するコード。

Python

行数 x 列数：shape
行数：shape[0]
列数：shape[1]

df.shape

# (64000, 12)

df.shape[0]

# 64000

df.shape[1]

# 12

R

行数 x 列数：dim()
行数：nrow()
列数：ncol()

df %>% dim()
# dim(df)# こちらでも可

# [1] 64000    12

df %>% nrow()
# nrow(df)# こちらでも可

# [1] 64000

df %>% ncol()
# ncol(df)# こちらでも可

# [1] 12

型の確認

データ各列の型を確認するコード。
Rだとstrで表すため、文字列と間違えないよう注意が必要です。

Python

dtypes

df.dtypes

# recency              int64
# history_segment     object
# history            float64
# mens                 int64
# womens               int64
# zip_code            object
# newbie               int64
# channel             object
# segment             object
# visit                int64
# conversion           int64
# spend              float64
# dtype: object

R

str()

df %>% str()
# str(df)# こちらでも可

# 'data.frame': 64000 obs. of  12 variables:
# $ recency        : int  10 6 7 9 2 6 9 9 9 10 ...
# $ history_segment: Factor w/ 7 levels "1) $0 - $100",..: 2 3 2 5 1 2 3 1 5 1 ...
# $ history        : num  142.4 329.1 180.7 675.8 45.3 ...
# $ mens           : int  1 1 0 1 1 0 1 0 1 0 ...
# $ womens         : int  0 1 1 0 0 1 0 1 1 1 ...
# $ zip_code       : Factor w/ 3 levels "Rural","Surburban",..: 2 1 2 1 3 2 2 3 1 3 ...
# $ newbie         : int  0 1 1 1 0 0 1 0 1 1 ...
# $ channel        : Factor w/ 3 levels "Multichannel",..: 2 3 3 3 3 2 2 2 2 3 ...
# $ segment        : Factor w/ 3 levels "Mens E-Mail",..: 3 2 3 1 3 3 3 3 1 3 ...
# $ visit          : int  0 0 0 0 0 1 0 0 0 0 ...
# $ conversion     : int  0 0 0 0 0 0 0 0 0 0 ...
# $ spend          : num  0 0 0 0 0 0 0 0 0 0 ...

要約統計量

データフレーム各列の要約統計量などを取得するコード。
Rはsummaryを使うことで、質的変数(主にfactor型)にも適用可能です。

Python

describe()

df.describe()

#             recency       history          mens        womens        newbie         visit    conversion         spend
# count  64000.000000  64000.000000  64000.000000  64000.000000  64000.000000  64000.000000  64000.000000  64000.000000
# mean       5.763734    242.085656      0.551031      0.549719      0.502250      0.146781      0.009031      1.050908
# std        3.507592    256.158608      0.497393      0.497526      0.499999      0.353890      0.094604     15.036448
# min        1.000000     29.990000      0.000000      0.000000      0.000000      0.000000      0.000000      0.000000
# 25%        2.000000     64.660000      0.000000      0.000000      0.000000      0.000000      0.000000      0.000000
# 50%        6.000000    158.110000      1.000000      1.000000      1.000000      0.000000      0.000000      0.000000
# 75%        9.000000    325.657500      1.000000      1.000000      1.000000      0.000000      0.000000      0.000000
# max       12.000000   3345.930000      1.000000      1.000000      1.000000      1.000000      1.000000    499.000000

R

summary()

df %>% summary()
# summary(df)# こちらでも可

#    recency               history_segment     history             mens           womens            zip_code
# Min.   : 1.000   1) $0 - $100    :22970   Min.   :  29.99   Min.   :0.000   Min.   :0.0000   Rural    : 9563
# 1st Qu.: 2.000   2) $100 - $200  :14254   1st Qu.:  64.66   1st Qu.:0.000   1st Qu.:0.0000   Surburban:28776
# Median : 6.000   3) $200 - $350  :12289   Median : 158.11   Median :1.000   Median :1.0000   Urban    :25661
# Mean   : 5.764   4) $350 - $500  : 6409   Mean   : 242.09   Mean   :0.551   Mean   :0.5497
# 3rd Qu.: 9.000   5) $500 - $750  : 4911   3rd Qu.: 325.66   3rd Qu.:1.000   3rd Qu.:1.0000
# Max.   :12.000   6) $750 - $1,000: 1859   Max.   :3345.93   Max.   :1.000   Max.   :1.0000
#                  7) $1,000 +     : 1308
#
#     newbie               channel               segment          visit          conversion
# Min.   :0.0000   Multichannel: 7762   Mens E-Mail  :21307   Min.   :0.0000   Min.   :0.000000
# 1st Qu.:0.0000   Phone       :28021   No E-Mail    :21306   1st Qu.:0.0000   1st Qu.:0.000000
# Median :1.0000   Web         :28217   Womens E-Mail:21387   Median :0.0000   Median :0.000000
# Mean   :0.5022                                              Mean   :0.1468   Mean   :0.009031
# 3rd Qu.:1.0000                                              3rd Qu.:0.0000   3rd Qu.:0.000000
# Max.   :1.0000                                              Max.   :1.0000   Max.   :1.000000
#
#     spend
# Min.   :  0.000
# 1st Qu.:  0.000
# Median :  0.000
# Mean   :  1.051
# 3rd Qu.:  0.000
# Max.   :499.000

番号指定

PythonとRは、数字で指定を行う時に違いがあるため、その差異について示します。
また、同じデータを表示するコードをPythonとRで示します。

Python

最初の番号：0番
[a:b]のa：含む
[a:b]のb：含まない

df.iloc[0:5,1:4]

#   history_segment  history  mens
# 0  2) $100 - $200   142.44     1
# 1  3) $200 - $350   329.08     1
# 2  2) $100 - $200   180.65     0
# 3  5) $500 - $750   675.83     1
# 4    1) $0 - $100    45.34     1

R

最初の番号：1番
[a:b]のa：含む
[a:b]のb：含む

df[1:5,2:4]

#   history_segment history mens
# 1  2) $100 - $200  142.44    1
# 2  3) $200 - $350  329.08    1
# 3  2) $100 - $200  180.65    0
# 4  5) $500 - $750  675.83    1
# 5    1) $0 - $100   45.34    1

列の選択

読み込んだデータフレームにおいて、特定の列を選択するコード。
また、Rではtidyverseを用いない場合のコードも載せています。

Python

[["hoge", "fuga"]]
loc[:,["hoge", "fuga"]]

df_new = df[["recency", "history", "zip_code"]]
# df_new = df.loc[:, ["recency", "history", "zip_code"]] # こちらでも可

# df_new.head()
#    recency  history   zip_code
# 0       10   142.44  Surburban
# 1        6   329.08      Rural
# 2        7   180.65  Surburban
# 3        9   675.83      Rural
# 4        2    45.34      Urban

R

select(hoge, fuga)

# tidyverseの場合
df_new <- df %>%
  select(recency, history, zip_code)

# tidyverseを利用しない場合
# df_new <- df[c("recency", "history", "zip_code")]

# df_new %>% head()
#   recency history  zip_code
# 1      10  142.44 Surburban
# 2       6  329.08     Rural
# 3       7  180.65 Surburban
# 4       9  675.83     Rural
# 5       2   45.34     Urban
# 6       6  134.83 Surburban

行の抽出

データフレームから、条件に合致するデータを抽出するコード。
今回はhistory > 3000とchannel == "Phone"の2つの条件を設定しております。

Python

[(hoge > a) & (fuga == b)]

df_new = df[(df.history > 3000) & (df.channel == "Phone")]

# df_new
#        recency history_segment  history  mens  womens   zip_code  newbie  channel    segment  visit  conversion  spend
# 4579         2     7) $1,000 +  3345.93     1       1  Surburban       1    Phone  No E-Mail      0           0    0.0
# 43060        1     7) $1,000 +  3003.48     1       1      Urban       1    Phone  No E-Mail      0           0    0.0

R

filter(hoge > a, fuga == b)

# tidyverseの場合
df_new <- df %>%
  filter(history > 3000,
         channel == "Phone")

# tidyverseを利用しない場合
# df_new <- df[df$history > 3000 & df$channel == "Phone",]

# df_new
#   recency history_segment history mens womens  zip_code newbie channel   segment visit conversion spend
# 1       2     7) $1,000 + 3345.93    1      1 Surburban      1   Phone No E-Mail     0          0     0
# 2       1     7) $1,000 + 3003.48    1      1     Urban      1   Phone No E-Mail     0          0     0

データ代入

条件に合致するとき、特定の列に値を代入するコード。
ここでは、spendの列に1000を代入しています。

Python

assign(hoge = a)

df_new = df[df.history > 3000].assign(spend = 1000)

# df_new
#        recency history_segment  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend
# 4579         2     7) $1,000 +  3345.93     1       1  Surburban       1         Phone      No E-Mail      0           0   1000
# 38680        1     7) $1,000 +  3215.97     1       1      Urban       1  Multichannel    Mens E-Mail      1           0   1000
# 43060        1     7) $1,000 +  3003.48     1       1      Urban       1         Phone      No E-Mail      0           0   1000
# 52860        1     7) $1,000 +  3040.20     0       1      Urban       1           Web  Womens E-Mail      0           0   1000

R

mutate(hoge = a)

# tidyverseの場合
df_new <- df %>%
  filter(history > 3000) %>%
  mutate(spend = 1000)

# tidyverseを利用しない場合
# df[df$history > 3000, "spend"] <- 1000

# df_new
#   recency history_segment history mens womens  zip_code newbie      channel       segment visit conversion spend
# 1       2     7) $1,000 + 3345.93    1      1 Surburban      1        Phone     No E-Mail     0          0  1000
# 2       1     7) $1,000 + 3215.97    1      1     Urban      1 Multichannel   Mens E-Mail     1          0  1000
# 3       1     7) $1,000 + 3003.48    1      1     Urban      1        Phone     No E-Mail     0          0  1000
# 4       1     7) $1,000 + 3040.20    0      1     Urban      1          Web Womens E-Mail     0          0  1000

列追加

新たな列を追加するコード。
上記では、データの置換を実行しましたが、ここでは新たな列を追加します。
下記コードでは、df_selectedに抽出結果を格納し、df_selectedに列を追加したものがdf_newとなっています。

Python

df["col"] = df.mutate.apply(lambda x: x.hoge * x.fuga, axis=1)

df_selected = df[["recency", "history", "zip_code", "channel", "spend"]]
df_selected = df_selected[df_selected.history > 3000]
df_new = df_selected.copy()
df_new["col"] = df_new.apply(lambda x: x.recency * x.history, axis=1)
# df_new["col"] = df_new.recency * df_new.history # こちらでも可

# df_new
#        recency  history   zip_code       channel  spend      col
# 4579         2  3345.93  Surburban         Phone    0.0  6691.86
# 38680        1  3215.97      Urban  Multichannel    0.0  3215.97
# 43060        1  3003.48      Urban         Phone    0.0  3003.48
# 52860        1  3040.20      Urban           Web    0.0  3040.20

R

df %>% mutate(col = hoge * fuga)

df_selected <- df %>%
  select(recency, history, zip_code, channel, spend) %>%
  filter(history > 3000)
df_new <- df_selected %>%
  mutate(col = recency * history)

# df_new
#   recency history  zip_code      channel spend     col
# 1       2 3345.93 Surburban        Phone     0 6691.86
# 2       1 3215.97     Urban Multichannel     0 3215.97
# 3       1 3003.48     Urban        Phone     0 3003.48
# 4       1 3040.20     Urban          Web     0 3040.20

重複削除

データフレームの重複を削除するコード。
ここでは、channel列の重複を削除しています。

Python

drop_duplicates()

df_new = df.channel.drop_duplicates()

# df_new
# 0            Phone
# 1              Web
# 12    Multichannel
# Name: channel, dtype: object

R

distinct()

df_new <- df %>%
  distinct(channel)

# df_new
#        channel
# 1        Phone
# 2          Web
# 3 Multichannel

列名変更

列名を変更するコード。
PythonとRでは、hogeとfugaの書く順番が違うため注意が必要です。

Python

rename(columns={"hoge" : "fuga"})

df_new = df.rename(columns={"recency":"RECENCY"})

# df_new.head()
#    RECENCY history_segment  history  mens  womens   zip_code  newbie channel        segment  visit  conversion  spend
# 0       10  2) $100 - $200   142.44     1       0  Surburban       0   Phone  Womens E-Mail      0           0    0.0
# 1        6  3) $200 - $350   329.08     1       1      Rural       1     Web      No E-Mail      0           0    0.0
# 2        7  2) $100 - $200   180.65     0       1  Surburban       1     Web  Womens E-Mail      0           0    0.0
# 3        9  5) $500 - $750   675.83     1       0      Rural       1     Web    Mens E-Mail      0           0    0.0
# 4        2    1) $0 - $100    45.34     1       0      Urban       0     Web  Womens E-Mail      0           0    0.0

R

rename(fuga = hoge)

df_new <- df %>%
  rename(RECENCY = recency)

# df_new %>% head()
#   RECENCY history_segment history mens womens  zip_code newbie channel       segment visit conversion spend
# 1      10  2) $100 - $200  142.44    1      0 Surburban      0   Phone Womens E-Mail     0          0     0
# 2       6  3) $200 - $350  329.08    1      1     Rural      1     Web     No E-Mail     0          0     0
# 3       7  2) $100 - $200  180.65    0      1 Surburban      1     Web Womens E-Mail     0          0     0
# 4       9  5) $500 - $750  675.83    1      0     Rural      1     Web   Mens E-Mail     0          0     0
# 5       2    1) $0 - $100   45.34    1      0     Urban      0     Web Womens E-Mail     0          0     0
# 6       6  2) $100 - $200  134.83    0      1 Surburban      0   Phone Womens E-Mail     1          0     0

データの列結合

データフレームを列どうしで結合するコード。
dfをdf1とdf2に分割し、列結合をしています。

Python

concat([df1, df2], axis=1)

df1 = df[["recency", "history", "zip_code"]]
df2 = df[["channel", "spend"]]
df_new = pd.concat([df1, df2], axis=1)

# df_new.head()
#    recency  history   zip_code channel  spend
# 0       10   142.44  Surburban   Phone    0.0
# 1        6   329.08      Rural     Web    0.0
# 2        7   180.65  Surburban     Web    0.0
# 3        9   675.83      Rural     Web    0.0
# 4        2    45.34      Urban     Web    0.0

R

bind_cols(df1, df2)

df1 <- df %>%
  select(recency, history, zip_code)
df2 <- df %>%
  select(channel, spend)
df_new <- df1 %>%
  bind_cols(df2)

# tidyverseを利用しない場合
# df_new <- cbind(df1, df2)

# df_new %>% head()
#   recency history  zip_code channel spend
# 1      10  142.44 Surburban   Phone     0
# 2       6  329.08     Rural     Web     0
# 3       7  180.65 Surburban     Web     0
# 4       9  675.83     Rural     Web     0
# 5       2   45.34     Urban     Web     0
# 6       6  134.83 Surburban   Phone     0

データの行結合

データフレームを行どうしで結合するコード。
dfをdf1とdf2に分割し、行結合をしています。

Python

concat([df1, df2], axis=0)

df1 = df[df.history > 3000]
df2 = df[df.history == 30]
df_new = pd.concat([df1, df2], axis=0)

# df_new
#        recency history_segment  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend
# 4579         2     7) $1,000 +  3345.93     1       1  Surburban       1         Phone      No E-Mail      0           0    0.0
# 38680        1     7) $1,000 +  3215.97     1       1      Urban       1  Multichannel    Mens E-Mail      1           0    0.0
# 43060        1     7) $1,000 +  3003.48     1       1      Urban       1         Phone      No E-Mail      0           0    0.0
# 52860        1     7) $1,000 +  3040.20     0       1      Urban       1           Web  Womens E-Mail      0           0    0.0
# 8645         9    1) $0 - $100    30.00     1       0      Rural       0         Phone      No E-Mail      0           0    0.0
# 56175       10    1) $0 - $100    30.00     1       0  Surburban       0         Phone    Mens E-Mail      0           0    0.0

R

bind_rows(df1, df2)

df1 <- df %>%
  filter(history > 3000)
df2 <- df %>%
  filter(history == 30)
df_new <- df1 %>%
  bind_rows(df2)

# tidyverseを利用しない場合
# df_new <- rbind(df1, df2)

# df_new
#   recency history_segment history mens womens  zip_code newbie      channel       segment visit conversion spend
# 1       2     7) $1,000 + 3345.93    1      1 Surburban      1        Phone     No E-Mail     0          0     0
# 2       1     7) $1,000 + 3215.97    1      1     Urban      1 Multichannel   Mens E-Mail     1          0     0
# 3       1     7) $1,000 + 3003.48    1      1     Urban      1        Phone     No E-Mail     0          0     0
# 4       1     7) $1,000 + 3040.20    0      1     Urban      1          Web Womens E-Mail     0          0     0
# 5       9    1) $0 - $100   30.00    1      0     Rural      0        Phone     No E-Mail     0          0     0
# 6      10    1) $0 - $100   30.00    1      0 Surburban      0        Phone   Mens E-Mail     0          0     0

列のマージ

データフレームを列でマージするコード。
dfをdf1とdf2に分割し、indexをキーとして内部結合しています。
indexの作り方は本記事の「インデックスを列に追加」の節をご確認ください。

Python

pd.merge(df1, df2, on = "hoge", how = "inner")

how = "inner"の他にもleft, right, outerなどがあります。

df1 = df.reset_index().loc[df.history > 3000, 
                           ["index", "recency", "history", "zip_code"]]
df2 = df.reset_index().loc[:, ["index", "channel", "spend"]]
df_new = pd.merge(df1, df2, on = "index", how = "inner")

# df_new
#    index  recency  history   zip_code       channel  spend
# 0   4579        2  3345.93  Surburban         Phone    0.0
# 1  38680        1  3215.97      Urban  Multichannel    0.0
# 2  43060        1  3003.48      Urban         Phone    0.0
# 3  52860        1  3040.20      Urban           Web    0.0

R

inner_join(df1, df2, by = "hoge")

inner_joinの他にもleft_join, right_join, full_joinなどがあります。

df1 <- df %>%
  rowid_to_column("index") %>%
  filter(history > 3000) %>%
  select(index, recency, history, zip_code)
df2 <- df %>%
  rowid_to_column("index") %>%
  select(index, channel, spend)
df_new <- df1 %>%
  inner_join(df2, by = "index")

# df_new
#   index recency history  zip_code      channel spend
# 1  4580       2 3345.93 Surburban        Phone     0
# 2 38681       1 3215.97     Urban Multichannel     0
# 3 43061       1 3003.48     Urban        Phone     0
# 4 52861       1 3040.20     Urban          Web     0

データの集約

groupbyでデータを分けて全列の統計量を算出するコード。
ここでは、dfのchannelでグルーピングして、各列の平均値を算出しています。

Python

groupby("hoge")

df_new = df.groupby("channel").mean()

# df_new
#                recency     history      mens    womens    newbie     visit  conversion     spend
# channel
# Multichannel  4.768488  520.970370  0.632698  0.624195  0.595723  0.171734    0.012626  1.401777
# Phone         5.897541  202.807184  0.539203  0.538953  0.493309  0.127155    0.007744  0.908892
# Web           5.904632  204.375017  0.540313  0.539923  0.485417  0.159407    0.009321  1.095420

R

group_by(hoge)

df_new <- df %>%
  group_by(channel) %>%
  summarise_all(funs(mean))

# df_new
# channel        recency history_segment history  mens womens zip_code newbie segment visit conversion spend
# <fct>            <dbl>           <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl> <dbl>      <dbl> <dbl>
# 1 Multichannel    4.77              NA    521. 0.633  0.624       NA  0.596      NA 0.172    0.0126  1.40
# 2 Phone           5.90              NA    203. 0.539  0.539       NA  0.493      NA 0.127    0.00774 0.909
# 3 Web             5.90              NA    204. 0.540  0.540       NA  0.485      NA 0.159    0.00932 1.10

グループ分けした時の要約統計量

グループ分けを行い、欲しい列を抽出した時の、要約統計量を算出するコード。
PythonとRにおけるコードの対応関係は下記のようになります。
ここでは、dfのchannelとzip_codeでグルーピングを行い、recency、history、spendの平均値と最大値を算出しています。

Python

groupby(["hoge", "fuga"])
agg("mean","max")[["col1", "col2"]]

df_new = df.groupby(["channel","zip_code"]).agg(["mean","max"])[["recency","history","spend"]]

# df_new
#                         recency       history              spend
#                         mean     max  mean        max      mean      max
# channel      zip_code
# Multichannel Rural      4.773451  12  523.810743  2079.58  1.124389  328.36
#              Surburban  4.760173  12  521.458574  2809.79  1.357048  499.00
#              Urban      4.775813  12  519.422769  3215.97  1.549687  421.76
# Phone        Rural      5.827175  12  207.631497  2080.78  1.070033  499.00
#              Surburban  5.946969  12  200.612477  3345.93  0.809698  499.00
#              Urban      5.868908  12  203.447781  3003.48  0.958960  499.00
# Web          Rural      5.889335  12  202.806019  2039.13  1.336614  362.20
#              Surburban  5.944801  12  203.407815  2816.01  1.076664  499.00
#              Urban      5.864510  12  206.074893  3040.20  1.025505  499.00

R

group_by(hoge, fuga)
summarize_each(funs(mean, max), col1, col2)

df_new <- df %>%
  group_by(channel, zip_code) %>%
  summarise_each(funs(mean, max),
                 recency, history, spend)

# df_new
#   channel      zip_code  recency_mean history_mean spend_mean recency_max history_max spend_max
#   <fct>        <fct>            <dbl>        <dbl>      <dbl>       <int>       <dbl>     <dbl>
# 1 Multichannel Rural             4.77         524.      1.12           12       2080.      328.
# 2 Multichannel Surburban         4.76         521.      1.36           12       2810.      499
# 3 Multichannel Urban             4.78         519.      1.55           12       3216.      422.
# 4 Phone        Rural             5.83         208.      1.07           12       2081.      499
# 5 Phone        Surburban         5.95         201.      0.810          12       3346.      499
# 6 Phone        Urban             5.87         203.      0.959          12       3003.      499
# 7 Web          Rural             5.89         203.      1.34           12       2039.      362.
# 8 Web          Surburban         5.94         203.      1.08           12       2816.      499
# 9 Web          Urban             5.86         206.      1.03           12       3040.      499

データのソート

列を選択し、データのソートをするコード。
historyの値が昇順になるように並び替えています。

Python

sort_values(["hoge"])

df_new = df[df.history > 3000]
df_new = df_new.sort_values(["history"])

# df_new
#        recency history_segment  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend
# 43060        1     7) $1,000 +  3003.48     1       1      Urban       1         Phone      No E-Mail      0           0    0.0
# 52860        1     7) $1,000 +  3040.20     0       1      Urban       1           Web  Womens E-Mail      0           0    0.0
# 38680        1     7) $1,000 +  3215.97     1       1      Urban       1  Multichannel    Mens E-Mail      1           0    0.0
# 4579         2     7) $1,000 +  3345.93     1       1  Surburban       1         Phone      No E-Mail      0           0    0.0

R

arrange(hoge)

df_new <- df %>%
  filter(history > 3000) %>%
  arrange(history)

# df_new
#   recency history_segment history mens womens  zip_code newbie      channel       segment visit conversion spend
# 1       1     7) $1,000 + 3003.48    1      1     Urban      1        Phone     No E-Mail     0          0     0
# 2       1     7) $1,000 + 3040.20    0      1     Urban      1          Web Womens E-Mail     0          0     0
# 3       1     7) $1,000 + 3215.97    1      1     Urban      1 Multichannel   Mens E-Mail     1          0     0
# 4       2     7) $1,000 + 3345.93    1      1 Surburban      1        Phone     No E-Mail     0          0     0

インデックスを列に追加

データの1列目に、元データの行番号を追加するコード。
元の行番号を列に追加して、indexを振り直しています。

Python

insert(0, "id", df.index)

reset_index()でも可能ですが、場所を選びたい時はinsertの方が便利です。

df_new = df.copy()
df_new.insert(0, "id", df_new.index)
df_new = df_new[df_new.history > 3000]
df_new = df_new.reset_index(drop=True)

# df_new
#       id  recency history_segment  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend
# 0   4579        2     7) $1,000 +  3345.93     1       1  Surburban       1         Phone      No E-Mail      0           0    0.0
# 1  38680        1     7) $1,000 +  3215.97     1       1      Urban       1  Multichannel    Mens E-Mail      1           0    0.0
# 2  43060        1     7) $1,000 +  3003.48     1       1      Urban       1         Phone      No E-Mail      0           0    0.0
# 3  52860        1     7) $1,000 +  3040.20     0       1      Urban       1           Web  Womens E-Mail      0           0    0.0

R

rowid_to_column("id")

df_new <- df %>%
  rowid_to_column("id") %>%
  filter(history > 3000)

# df_new
#      id recency history_segment history mens womens  zip_code newbie      channel       segment visit conversion spend
# 1  4580       2     7) $1,000 + 3345.93    1      1 Surburban      1        Phone     No E-Mail     0          0     0
# 2 38681       1     7) $1,000 + 3215.97    1      1     Urban      1 Multichannel   Mens E-Mail     1          0     0
# 3 43061       1     7) $1,000 + 3003.48    1      1     Urban      1        Phone     No E-Mail     0          0     0
# 4 52861       1     7) $1,000 + 3040.20    0      1     Urban      1          Web Womens E-Mail     0          0     0

データの縦持ち

複数列のデータを1列にするコード。
id列・変数名列・変数列の3つの列を持つデータに変更しています。
Python、Rともに同じ結果になるようにしています。

Python

stack()

df_vertical = df[df.history > 3000]
df_vertical = pd.DataFrame(df_vertical.stack())
df_vertical = df_vertical.rename(columns={0: "values"})
df_vertical.index.names = ["id", "keys"]

# df_vertical.head(24)
#                              values
# id    keys
# 4579  recency                     2
#       history_segment   7) $1,000 +
#       history               3345.93
#       mens                        1
#       womens                      1
#       zip_code            Surburban
#       newbie                      1
#       channel                 Phone
#       segment             No E-Mail
#       visit                       0
#       conversion                  0
#       spend                       0
# 38680 recency                     1
#       history_segment   7) $1,000 +
#       history               3215.97
#       mens                        1
#       womens                      1
#       zip_code                Urban
#       newbie                      1
#       channel          Multichannel
#       segment           Mens E-Mail
#       visit                       1
#       conversion                  0
#       spend                       0

R

gather()

df_vertical <- df %>%
  rowid_to_column("id") %>%
  filter(history > 3000) %>%
  gather(key = keys, value = values, -id) %>%
  arrange(id)

# df_vertical %>% head(24)
#       id            keys       values
# 1   4580         recency            2
# 2   4580 history_segment  7) $1,000 +
# 3   4580         history      3345.93
# 4   4580            mens            1
# 5   4580          womens            1
# 6   4580        zip_code    Surburban
# 7   4580          newbie            1
# 8   4580         channel        Phone
# 9   4580         segment    No E-Mail
# 10  4580           visit            0
# 11  4580      conversion            0
# 12  4580           spend            0
# 13 38681         recency            1
# 14 38681 history_segment  7) $1,000 +
# 15 38681         history      3215.97
# 16 38681            mens            1
# 17 38681          womens            1
# 18 38681        zip_code        Urban
# 19 38681          newbie            1
# 20 38681         channel Multichannel
# 21 38681         segment  Mens E-Mail
# 22 38681           visit            1
# 23 38681      conversion            0
# 24 38681           spend            0

縦持ちの横持ち化

上記でstack及びgatherによって縦持ちにしたdf_verticalを横持ちに戻すコード。

Python

unstack()

df_horizontal = df_vertical.unstack()

# df_horizontal
#        values
# keys  recency history_segment  history mens womens   zip_code newbie       channel        segment visit conversion spend
# id
# 4579        2     7) $1,000 +  3345.93    1      1  Surburban      1         Phone      No E-Mail     0          0     0
# 38680       1     7) $1,000 +  3215.97    1      1      Urban      1  Multichannel    Mens E-Mail     1          0     0
# 43060       1     7) $1,000 +  3003.48    1      1      Urban      1         Phone      No E-Mail     0          0     0
# 52860       1     7) $1,000 +   3040.2    0      1      Urban      1           Web  Womens E-Mail     0          0     0

ただ、columnが2行になってしまうので、以下のようにすると元に戻ります。

cols = [v2 for v1, v2 in df_horizontal.columns.values]
df_horizontal.columns = cols

# df_horizontal
#       recency history_segment  history mens womens   zip_code newbie       channel        segment visit conversion spend
# id
# 4579        2     7) $1,000 +  3345.93    1      1  Surburban      1         Phone      No E-Mail     0          0     0
# 38680       1     7) $1,000 +  3215.97    1      1      Urban      1  Multichannel    Mens E-Mail     1          0     0
# 43060       1     7) $1,000 +  3003.48    1      1      Urban      1         Phone      No E-Mail     0          0     0
# 52860       1     7) $1,000 +   3040.2    0      1      Urban      1           Web  Womens E-Mail     0          0     0

R

spread()

df_horizontal <- df_vertical %>%
  spread(key = keys, value = values)

# df_horizontal
#      id      channel conversion history history_segment mens newbie recency       segment spend visit womens  zip_code
# 1  4580        Phone          0 3345.93     7) $1,000 +    1      1       2     No E-Mail     0     0      1 Surburban
# 2 38681 Multichannel          0 3215.97     7) $1,000 +    1      1       1   Mens E-Mail     0     1      1     Urban
# 3 43061        Phone          0 3003.48     7) $1,000 +    1      1       1     No E-Mail     0     0      1     Urban
# 4 52861          Web          0  3040.2     7) $1,000 +    0      1       1 Womens E-Mail     0     0      1     Urban

列の分割

正規表現を用いて1つの列を複数の列に分割するコード。
history_segmentとsegmentの列をそれぞれ分割しました。
hitory_segmentにはエスケープの問題がありますが、その説明は割愛します。

Python

df.hoge.str.split().tolist(), columns=["col1", "col2"])

df1 = pd.DataFrame(df.history_segment.str.split("\) ").tolist(),
                   columns=["hs1", "hs2"])
df2 = pd.DataFrame(df.segment.str.split().tolist(),
                   columns=["seg1", "seg2"])
df_new = pd.concat([df, df1, df2], axis=1)
df_new = df_new.drop(["history_segment", "segment"], axis=1)

# df.head()
#    recency history_segment  history  mens  womens   zip_code  newbie channel        segment  visit  conversion  spend
# 0       10  2) $100 - $200   142.44     1       0  Surburban       0   Phone  Womens E-Mail      0           0    0.0
# 1        6  3) $200 - $350   329.08     1       1      Rural       1     Web      No E-Mail      0           0    0.0
# 2        7  2) $100 - $200   180.65     0       1  Surburban       1     Web  Womens E-Mail      0           0    0.0
# 3        9  5) $500 - $750   675.83     1       0      Rural       1     Web    Mens E-Mail      0           0    0.0
# 4        2    1) $0 - $100    45.34     1       0      Urban       0     Web  Womens E-Mail      0           0    0.0

# df_new.head()
#    recency  history  mens  womens   zip_code  newbie channel  visit  conversion  spend hs1          hs2    seg1    seg2
# 0       10   142.44     1       0  Surburban       0   Phone      0           0    0.0   2  $100 - $200  Womens  E-Mail
# 1        6   329.08     1       1      Rural       1     Web      0           0    0.0   3  $200 - $350      No  E-Mail
# 2        7   180.65     0       1  Surburban       1     Web      0           0    0.0   2  $100 - $200  Womens  E-Mail
# 3        9   675.83     1       0      Rural       1     Web      0           0    0.0   5  $500 - $750    Mens  E-Mail
# 4        2    45.34     1       0      Urban       0     Web      0           0    0.0   1    $0 - $100  Womens  E-Mail

R

separate(hoge, c("col1", "col2"), " ")

df_new <- df %>%
  separate(history_segment, c("hs1", "hs2"), "[)] ") %>%
  separate(segment, c("seg1", "seg2"), " ")

# df %>% head()
#   recency history_segment history mens womens  zip_code newbie channel       segment visit conversion spend
# 1      10  2) $100 - $200  142.44    1      0 Surburban      0   Phone Womens E-Mail     0          0     0
# 2       6  3) $200 - $350  329.08    1      1     Rural      1     Web     No E-Mail     0          0     0
# 3       7  2) $100 - $200  180.65    0      1 Surburban      1     Web Womens E-Mail     0          0     0
# 4       9  5) $500 - $750  675.83    1      0     Rural      1     Web   Mens E-Mail     0          0     0
# 5       2    1) $0 - $100   45.34    1      0     Urban      0     Web Womens E-Mail     0          0     0
# 6       6  2) $100 - $200  134.83    0      1 Surburban      0   Phone Womens E-Mail     1          0     0

# df_new %>% head()
#   recency hs1         hs2 history mens womens  zip_code newbie channel   seg1   seg2 visit conversion spend
# 1      10   2 $100 - $200  142.44    1      0 Surburban      0   Phone Womens E-Mail     0          0     0
# 2       6   3 $200 - $350  329.08    1      1     Rural      1     Web     No E-Mail     0          0     0
# 3       7   2 $100 - $200  180.65    0      1 Surburban      1     Web Womens E-Mail     0          0     0
# 4       9   5 $500 - $750  675.83    1      0     Rural      1     Web   Mens E-Mail     0          0     0
# 5       2   1   $0 - $100   45.34    1      0     Urban      0     Web Womens E-Mail     0          0     0
# 6       6   2 $100 - $200  134.83    0      1 Surburban      0   Phone Womens E-Mail     1          0     0

NAの削除

データフレームの中にNAがある時に行削除をするコード。
コードの前半では、上記で説明した列の分割を利用して敢えてNAを作っています。
これは、history_segmentの列に7) $1,000 +が1308個含まれているためです。
行数が64000から、62692になっていることを確認できると思います。

Python

dropna()

df1 = pd.DataFrame(df.history_segment.str.split().tolist(), columns=["hs1","hs2","hs3","hs4"])
df_na = pd.concat([df, df1], axis=1)
df_na = df_na.drop("history_segment", axis=1)
df_new = df_na.dropna()

# df_na.iloc[67:72,]
#     recency  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend hs1     hs2 hs3   hs4
# 67       11    53.81     1       0      Urban       1         Phone      No E-Mail      0           0    0.0  1)      $0   -  $100
# 68        9   154.15     0       1  Surburban       0           Web    Mens E-Mail      0           0    0.0  2)    $100   -  $200
# 69       10  1009.44     1       0  Surburban       1  Multichannel  Womens E-Mail      0           0    0.0  7)  $1,000   +  None
# 70        2   278.80     1       0      Rural       0           Web    Mens E-Mail      0           0    0.0  3)    $200   -  $350
# 71       10    31.12     1       0      Urban       0         Phone  Womens E-Mail      0           0    0.0  1)      $0   -  $100
#
# print(len(df_na))
# 62692

# df_new.iloc[67:72,]
#     recency  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend hs1     hs2 hs3   hs4
# 67       11    53.81     1       0      Urban       1         Phone      No E-Mail      0           0    0.0  1)      $0   -  $100
# 68        9   154.15     0       1  Surburban       0           Web    Mens E-Mail      0           0    0.0  2)    $100   -  $200
# 70        2   278.80     1       0      Rural       0           Web    Mens E-Mail      0           0    0.0  3)    $200   -  $350
# 71       10    31.12     1       0      Urban       0         Phone  Womens E-Mail      0           0    0.0  1)      $0   -  $100
# 72        2   428.74     1       0      Rural       0        Phone     Mens E-Mail      0           0    0.0  4)    $350   -  $500
#
# print(len(df_new))
# 64000

R

drop_na()

df_na <- df %>%
  separate(history_segment, c("hs1","hs2","hs3","hs4"), " ", convert=T)
df_new <- df_na %>%
  drop_na()

# df_na[68:72,]
#    recency hs1    hs2 hs3  hs4 history mens womens  zip_code newbie      channel       segment visit conversion spend
# 68      11  1)     $0   - $100   53.81    1      0     Urban      1        Phone     No E-Mail     0          0     0
# 69       9  2)   $100   - $200  154.15    0      1 Surburban      0          Web   Mens E-Mail     0          0     0
# 70      10  7) $1,000   + <NA> 1009.44    1      0 Surburban      1 Multichannel Womens E-Mail     0          0     0
# 71       2  3)   $200   - $350  278.80    1      0     Rural      0          Web   Mens E-Mail     0          0     0
# 72      10  1)     $0   - $100   31.12    1      0     Urban      0        Phone Womens E-Mail     0          0     0
#
# nrow(df_na)
# 64000

# df_new[68:72,]
#    recency hs1  hs2 hs3  hs4 history mens womens  zip_code newbie channel       segment visit conversion spend
# 68      11  1)   $0   - $100   53.81    1      0     Urban      1   Phone     No E-Mail     0          0     0
# 69       9  2) $100   - $200  154.15    0      1 Surburban      0     Web   Mens E-Mail     0          0     0
# 71       2  3) $200   - $350  278.80    1      0     Rural      0     Web   Mens E-Mail     0          0     0
# 72      10  1)   $0   - $100   31.12    1      0     Urban      0   Phone Womens E-Mail     0          0     0
# 73       2  4) $350   - $500  428.74    1      0     Rural      0   Phone   Mens E-Mail     0          0     0
#
# nrow(df_new)
# 62692

NAの置換

NAの削除をするコード。
NAを作り出すまでは、上記「NAの削除」と同様です。
「NAの削除」では、NAの削除を行っていますが、似たようなコードでNAを埋めることもできます。

Python

fillna(value={"hoge": "fuga"})

df1 = pd.DataFrame(df.history_segment.str.split().tolist(), columns=["hs1","hs2","hs3","hs4"])
df_na = pd.concat([df,df1],axis=1)
df_na = df_na.drop("history_segment", axis=1)
df_new = df_na.fillna(value={"hs4":"$1000"})

# display(df_na.iloc[67:72,])
#     recency  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend hs1     hs2 hs3   hs4
# 67       11    53.81     1       0      Urban       1         Phone      No E-Mail      0           0    0.0  1)      $0   -  $100
# 68        9   154.15     0       1  Surburban       0           Web    Mens E-Mail      0           0    0.0  2)    $100   -  $200
# 69       10  1009.44     1       0  Surburban       1  Multichannel  Womens E-Mail      0           0    0.0  7)  $1,000   +  None
# 70        2   278.80     1       0      Rural       0           Web    Mens E-Mail      0           0    0.0  3)    $200   -  $350
# 71       10    31.12     1       0      Urban       0         Phone  Womens E-Mail      0           0    0.0  1)      $0   -  $100
#
# print(len(df))
# 64000

# display(df_new.iloc[67:72,])
#     recency  history  mens  womens   zip_code  newbie       channel        segment  visit  conversion  spend hs1     hs2 hs3    hs4
# 67       11    53.81     1       0      Urban       1         Phone      No E-Mail      0           0    0.0  1)      $0   -   $100
# 68        9   154.15     0       1  Surburban       0           Web    Mens E-Mail      0           0    0.0  2)    $100   -   $200
# 69       10  1009.44     1       0  Surburban       1  Multichannel  Womens E-Mail      0           0    0.0  7)  $1,000   +  $1000
# 70        2   278.80     1       0      Rural       0           Web    Mens E-Mail      0           0    0.0  3)    $200   -   $350
# 71       10    31.12     1       0      Urban       0         Phone  Womens E-Mail      0           0    0.0  1)      $0   -   $100
#
# print(len(df_new))
# 64000

R

replace_na(list(hoge = "fuga"))

df_na <- df %>%
  separate(history_segment, c("hs1","hs2","hs3","hs4"), " ", convert=T)
df_new <- df_na %>%
  replace_na(list(hs4="$1000"))

# df_na[68:72,]
#    recency hs1    hs2 hs3  hs4 history mens womens  zip_code newbie      channel       segment visit conversion spend
# 68      11  1)     $0   - $100   53.81    1      0     Urban      1        Phone     No E-Mail     0          0     0
# 69       9  2)   $100   - $200  154.15    0      1 Surburban      0          Web   Mens E-Mail     0          0     0
# 70      10  7) $1,000   + <NA> 1009.44    1      0 Surburban      1 Multichannel Womens E-Mail     0          0     0
# 71       2  3)   $200   - $350  278.80    1      0     Rural      0          Web   Mens E-Mail     0          0     0
# 72      10  1)     $0   - $100   31.12    1      0     Urban      0        Phone Womens E-Mail     0          0     0
#
# nrow(df_na)
# 64000

# df_new[68:72,]
#    recency hs1    hs2 hs3   hs4 history mens womens  zip_code newbie      channel       segment visit conversion spend
# 68      11  1)     $0   -  $100   53.81    1      0     Urban      1        Phone     No E-Mail     0          0     0
# 69       9  2)   $100   -  $200  154.15    0      1 Surburban      0          Web   Mens E-Mail     0          0     0
# 70      10  7) $1,000   + $1000 1009.44    1      0 Surburban      1 Multichannel Womens E-Mail     0          0     0
# 71       2  3)   $200   -  $350  278.80    1      0     Rural      0          Web   Mens E-Mail     0          0     0
# 72      10  1)     $0   -  $100   31.12    1      0     Urban      0        Phone Womens E-Mail     0          0     0
#
# nrow(df_new)
# 64000

サンプリング

行のサンプリングをするコード。
データフレームの中からいくつかの行を抽出します。
やり方は主に2種類あり、
・抽出する個数を選択する
・抽出する割合を選択する
のどちらかです。

Python

sample(a)
sample(frac = a)

df1 = df.sample(64) # 個数
df2 = df.sample(frac=0.001) # 割合

# df1.head()
#        recency   history_segment  history  mens  womens   zip_code  newbie  channel        segment  visit  conversion  spend
# 57905        4  6) $750 - $1,000   799.47     0       1      Urban       1      Web  Womens E-Mail      0           0    0.0
# 51278        7      1) $0 - $100    83.27     1       0      Urban       0    Phone      No E-Mail      0           0    0.0
# 12856       10    4) $350 - $500   362.40     1       1      Urban       0      Web      No E-Mail      0           0    0.0
# 2306        11    2) $100 - $200   146.22     0       1  Surburban       1    Phone  Womens E-Mail      0           0    0.0
# 6540        10      1) $0 - $100    93.61     1       0      Urban       1    Phone  Womens E-Mail      0           0    0.0
#
# df1.shape
# (64, 12)

# df2.head()
#        recency history_segment  history  mens  womens   zip_code  newbie  channel        segment  visit  conversion  spend
# 27973        5    1) $0 - $100    37.82     0       1      Urban       1      Web    Mens E-Mail      0           0    0.0
# 31988       12  2) $100 - $200   129.14     0       1      Urban       0      Web  Womens E-Mail      0           0    0.0
# 12988        2  2) $100 - $200   138.71     0       1  Surburban       0      Web      No E-Mail      1           0    0.0
# 21122        4  3) $200 - $350   345.04     1       0  Surburban       0    Phone    Mens E-Mail      0           0    0.0
# 17095        2    1) $0 - $100    39.85     1       0      Rural       0    Phone  Womens E-Mail      0           0    0.0
#
# df2.shape
# (64, 12)

R

sample_n(a)
sample_frac(a)

df1 <- df %>%
  sample_n(64) # 個数
df2 <- df %>%
  sample_frac(0.001) # 割合

# df1 %>% head()
#   recency history_segment history mens womens  zip_code newbie      channel       segment visit conversion spend
# 1      10    1) $0 - $100   50.93    1      0 Surburban      1        Phone   Mens E-Mail     0          0     0
# 2       4  2) $100 - $200  124.94    0      1     Rural      0          Web   Mens E-Mail     0          0     0
# 3      10  5) $500 - $750  686.45    1      1 Surburban      1 Multichannel   Mens E-Mail     0          0     0
# 4       3     7) $1,000 + 1829.32    1      1     Rural      1 Multichannel   Mens E-Mail     0          0     0
# 5       1  3) $200 - $350  255.71    1      0 Surburban      0 Multichannel Womens E-Mail     0          0     0
# 6       5    1) $0 - $100   29.99    1      0 Surburban      1          Web   Mens E-Mail     1          0     0
#
# dim(df1)
# 64 12

# df2 %>% head()
#   recency history_segment history mens womens  zip_code newbie      channel       segment visit conversion spend
# 1       6    1) $0 - $100   29.99    1      0 Surburban      1          Web     No E-Mail     0          0     0
# 2      12  2) $100 - $200  125.92    0      1     Urban      1          Web Womens E-Mail     0          0     0
# 3       8  2) $100 - $200  157.14    0      1     Rural      0        Phone   Mens E-Mail     0          0     0
# 4      11  3) $200 - $350  284.96    0      1 Surburban      0 Multichannel   Mens E-Mail     1          0     0
# 5       2  4) $350 - $500  412.44    1      0     Rural      0          Web     No E-Mail     0          0     0
# 6       7    1) $0 - $100   49.97    1      0     Urban      0          Web Womens E-Mail     0          0     0
#
# dim(df2)
# 64 12

学習データとテストデータの分割

データの分割方法に関しては、いくつかありますが、ここでは以下の2つを用いて説明します。
他にもデータの分割方法はありますが、sklearnのtrain_test_splitと、caretのcreateDataPartitionについて説明いたします。
ここでは、segmentを目的変数としてデータの分割を行います。学習データ数を全体の20%とし、segmentの比率が元データと同じになるように設定します。

Python

train_test_speciesでは、train_sizeを引数に取ることで、学習データ数を指定できます。また、train_sizeは比率と数どちらにも対応しており、下記コードの0.2の箇所に、数字を入れても同じような操作を実行できます。
また、train_sizeをtest_sizeと変更しても、テストデータに対して同じ操作を実行できます。
そして、stratify(層別化の意味)にyを入れることで、yの比率を一定にできます。

from sklearn.model_selection import train_test_split

X = df.drop("segment", axis=1)
y = df[["segment"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.2, stratify = y)

続いて、データフレームの中身の確認致します。学習データとテストデータに占めるsegmentの比率が元データの比率と同等になっていることがわかると思います。

# ==========元データの確認==========
df.head()
#    recency history_segment  history  mens  womens   zip_code  newbie channel        segment  visit  conversion  spend
# 0       10  2) $100 - $200   142.44     1       0  Surburban       0   Phone  Womens E-Mail      0           0    0.0
# 1        6  3) $200 - $350   329.08     1       1      Rural       1     Web      No E-Mail      0           0    0.0
# 2        7  2) $100 - $200   180.65     0       1  Surburban       1     Web  Womens E-Mail      0           0    0.0
# 3        9  5) $500 - $750   675.83     1       0      Rural       1     Web    Mens E-Mail      0           0    0.0
# 4        2    1) $0 - $100    45.34     1       0      Urban       0     Web  Womens E-Mail      0           0    0.0

df.shape
# (64000, 12)

df[["segment"]].value_counts()
# segment
# Womens E-Mail    21387
# Mens E-Mail      21307
# No E-Mail        21306
# dtype: int64

# ==========Xの確認==========
X_train.head()
#        recency history_segment  history  mens  womens   zip_code  newbie       channel  visit  conversion  spend
# 26855        6    1) $0 - $100    67.49     1       0  Surburban       0         Phone      0           0    0.0
# 14959        6  2) $100 - $200   152.28     1       0      Urban       0           Web      0           0    0.0
# 25884        2  5) $500 - $750   649.71     0       1      Rural       1  Multichannel      0           0    0.0
# 4999        10  4) $350 - $500   374.86     0       1  Surburban       0           Web      0           0    0.0
# 2814        11    1) $0 - $100    71.49     1       0      Urban       0           Web      0           0    0.0

print(X_train.shape, X_test.shape)
# (12800, 11) (51200, 11)

# ==========yの確認==========
y_train.head()
#              segment
# 26855      No E-Mail
# 14959    Mens E-Mail
# 25884    Mens E-Mail
# 4999       No E-Mail
# 2814   Womens E-Mail

print(y_train.shape, y_test.shape)
# (12800, 1) (51200, 1)

y_train.value_counts()
# segment
# Womens E-Mail    4278
# No E-Mail        4261
# Mens E-Mail      4261
# dtype: int64

y_test.value_counts()
# segment
# Womens E-Mail    17109
# Mens E-Mail      17046
# No E-Mail        17045
# dtype: int64

R

createDataPartitionでは、pを引数に取ることで、学習データの比率を指定できます。
また、listはデフォルトではTRUEとなっているため、FALSEに変更することで、train_idをlistではなく、matrix型として得ることができます。今回はmatrixにしようと思うので、下記ではlist = Fとしています。
そして、timesはtrain_idを得る回数です。times = kの時、train_idはk列のmatrixとして得ることができます。複数回のシミュレーションを実行したいときは便利です。

library(caret)
train_id <- df$segment %>%
  createDataPartition(p = 0.2, list = FALSE, times = 1)
X_train <- df[train_id, !(colnames(df) %in% "segment")]
X_test <- df[-train_id, !(colnames(df) %in% "segment")]
y_train <- df[train_id, "segment"]
y_test <- df[-train_id, "segment"]

続いて、Python同様データフレームの中身の確認致します。学習データとテストデータに占めるsegmentの比率が元データの比率と同等になっていることがわかると思います。一方で、Pythonの場合とは異なる結果になっております。
(最後のtable()の操作は、fct_count()と変更してもほぼ同じ内容を実行する事ができます。)

# ==========元データの確認==========
df %>%
  head()
#   recency history_segment history mens womens  zip_code newbie channel       segment visit conversion spend
# 1      10  2) $100 - $200  142.44    1      0 Surburban      0   Phone Womens E-Mail     0          0     0
# 2       6  3) $200 - $350  329.08    1      1     Rural      1     Web     No E-Mail     0          0     0
# 3       7  2) $100 - $200  180.65    0      1 Surburban      1     Web Womens E-Mail     0          0     0
# 4       9  5) $500 - $750  675.83    1      0     Rural      1     Web   Mens E-Mail     0          0     0
# 5       2    1) $0 - $100   45.34    1      0     Urban      0     Web Womens E-Mail     0          0     0
# 6       6  2) $100 - $200  134.83    0      1 Surburban      0   Phone Womens E-Mail     1          0     0

df %>%
  dim()
# [1] 64000    12

df %>%
  select(segment) %>%
  table()
# .
# Mens E-Mail     No E-Mail Womens E-Mail
# 21307         21306         21387

# ==========Xの確認==========
X_train %>%
  head()
#    recency  history_segment history mens womens  zip_code newbie      channel visit conversion spend
# 2        6   3) $200 - $350  329.08    1      1     Rural      1          Web     0          0     0
# 4        9   5) $500 - $750  675.83    1      0     Rural      1          Web     0          0     0
# 16       3     1) $0 - $100   58.13    1      0     Urban      1          Web     1          0     0
# 20       5 6) $750 - $1,000  828.42    1      0 Surburban      1 Multichannel     0          0     0
# 33       6   2) $100 - $200  128.01    0      1     Urban      0          Web     0          0     0
# 47       2   4) $350 - $500  391.33    1      0 Surburban      0          Web     0          0     0

X_train %>%
  dim()
# [1] 12802    11
X_test %>%
  dim()
# [1] 51198    11

# ==========yの確認==========
y_train %>%
  head()
# [1] No E-Mail   Mens E-Mail No E-Mail   Mens E-Mail Mens E-Mail No E-Mail
# Levels: Mens E-Mail No E-Mail Womens E-Mail

y_train %>%
  length()
# [1] 12802
y_test %>%
  length()
# [1] 51198

y_train %>%
  table()
# .
# Mens E-Mail     No E-Mail Womens E-Mail
# 4262          4262          4278
y_test %>%
  table()
# .
# Mens E-Mail     No E-Mail Womens E-Mail
# 17045         17044         17109

最後に

以上で、『PythonのpandasとRのdplyrを用いたデータフレーム操作比較』の記事を終わりとします。他にも比較したいコードが出てきたら、都度加筆していきたいと思います。

\ いいなと思ったらシェア /

ブログトップに戻る