问题描述
我有一个包含几个社交媒体用户及其关注者的数据表.原始数据表具有以下格式:
I have a data table with several social media users and his/her followers. The original data table has the following format:
X.USERID FOLLOWERS
1081 4053807021,2476584389,4713715543, ...
因此,每一行都包含一个用户以及他/她的ID和一个追随者矢量(用逗号分隔).我总共拥有24,000个唯一的用户ID和160,000,000个唯一的关注者.我希望将原始表转换为以下格式:
So each row contains a user together with his/her ID and a vector of followers (seperated by a comma). In total I have 24,000 unique user IDs together with 160,000,000 unique followers. I wish to convert my original table in the following format:
X.USERID FOLLOWERS
1: 1081 4053807021
2: 1081 2476584389
3: 1081 4713715543
4: 1081 580410695
5: 1081 4827723557
6: 1081 704326016165142528
为了获得此数据表,我使用了以下代码行(假设我的原始数据表称为dt):
In order to get this data table I used the following line of code (assume that my original data table is called dt):
uf <- dt[,list(FOLLOWERS = unlist(strsplit(x = FOLLOWERS, split= ','))), by = X.USERID]
但是,当我在整个数据集上运行此代码时,出现以下错误:
However when I run this code on the entire dataset I get the following error:
不允许负长度向量
根据堆栈溢出的这篇文章(后,data.table中的行数为负数),看来我正碰到data.table中该列的内存限制.作为一种解决方法,我以较小的块(每10,000个)运行代码,这似乎可行.
According to this post on stack overflow (Negative number of rows in data.table after incorrect use of set ), it seems that I am bumping into the memory limits of the column in data.table. As a workaround, I ran the code in smaller blocks (per 10,000) and this seemed to work.
我的问题是:如果更改代码,是否可以防止发生此错误,或者我是否碰到了R的限制?
My question is: if I change my code can I prevent this error from occuring or am I bumping into the limits of R?
PS.我有一台可以使用140gb RAM的计算机,因此物理内存空间不应该成为问题.
PS. I have a machine with 140gb RAM at my disposal, so physical memory space should not be the issue.
> memory.limit()
[1] 147446
推荐答案
当数据集中的行数超过R的2 ^ 32-1的限制时,就会发生此问题. 解决此问题的方法之一是分块(在循环内)读取数据集.看来您的文件是按X.USERID字段排序的,因此您的块(读取文件时)应与唯一关注者的数量重叠,以确保每个用户至少属于一个包含所有关注者的块.处理这些大块的方式在很大程度上取决于您需要对数据进行什么处理.
This problem occurs when the number of rows in your dataset exceeds R's limit of 2^32-1. One of the ways to deal with this problem is to read your dataset in chunks (within a loop).It looks like your file is sorted by X.USERID field, so your chunks (when you read the file) should overlap by the number of unique followers to insure each user belongs to at least one chunk that contains all followers.The way you process this chunks would very much depend on what you need to do with your data.
这篇关于数据表中的内存限制:不允许负长度向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!