问题描述
我正在尝试分析 Airbnb
和便利设施
列中的商品的大型数据集列出列出的设施。
I'm trying to analyze a large data set for listings on Airbnb
and in the amenities
column, it lists out the amenities that the listing has.
例如,
{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire
extinguisher",Essentials,Shampoo,Hangers}
和
{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in
building",Heating,"Suitable for events","Smoke detector","Carbon monoxide
detector","First aid kit",Essentials,Shampoo,"Lock on bedroom
door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation
missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
我要解决两个问题:
-
我想将字符串分成不同的列,例如将会有一个标题为
TV
的列。如果字符串包含TV
,则相应单元格中的条目将为1,否则为0。我该怎么办?
I would like to split the string into different columns, e.g. there will be a column with a title
TV
. If the string containsTV
, the entry in the corresponding cell will be 1 and 0 otherwise. How can I do this?
如何删除缺少翻译的变量:.....
?
推荐答案
这是一种同时使用<$ c从 data.table
包中的$ c> dcast(),如答案,但也解决了数据清理的乏味但重要的细节。
Here is an approach which uses also dcast()
from the data.table
package as in this answer but addresses also the tedious but important details of data cleaning.
library(data.table)
# read data file, returning one column
raw <- fread("AirBnB.csv", header = FALSE, sep = "\n", col.names = "amenities")
# add column with row numbers
raw[, rn := seq_len(.N)]
# remove opening and closing curly braces
raw[, amenities := stringr::str_replace_all(amenities, "^\\{|\\}$", "")]
# split amenities, thereby reshaping from wide to long format
long <- raw[, strsplit(amenities, ",", fixed = TRUE), by = rn]
# remove double quotes and leading and trailing whitespace
long[, V1 := stringr::str_trim(stringr::str_replace_all(V1, '["]', ""))]
# reshape from long to wide format, omitting rows which contain "translation missing..."
dcast(long[!V1 %like% "^translation missing"], rn ~ V1, length, value.var = "rn", fill = 0)
# rn Air conditioning Carbon monoxide detector Elevator in building Essentials
#1: 1 1 0 0 1
#2: 2 1 1 1 1
# Fire extinguisher First aid kit Hair dryer Hangers Heating Iron Kitchen
#1: 1 0 0 1 1 0 1
#2: 0 1 1 1 1 1 1
# Laptop friendly workspace Lock on bedroom door Shampoo Smoke detector
#1: 0 0 1 0
#2: 1 1 1 1
# Suitable for events TV Wireless Internet
#1: 0 0 1
#2: 1 1 1
数据文件
OP仅提供了两个数据样本,这些样本已复制到名为<$ c的数据文件中$ c> AirBnB.csv :
{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire extinguisher",Essentials,Shampoo,Hangers}
{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in building",Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
这篇关于如何将字符串拆分为不同的变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!