我是TensorFlow的新手,正在尝试将Estimator API用于一些简单的分类实验。我在libsvm format中有一个稀疏的数据集。以下输入函数适用于小型数据集:
def libsvm_input_function(file):
def input_function():
indexes_raw = []
indicators_raw = []
values_raw = []
labels_raw = []
i=0
for line in open(file, "r"):
data = line.split(" ")
label = int(data[0])
for fea in data[1:]:
id, value = fea.split(":")
indexes_raw.append([i,int(id)])
indicators_raw.append(int(1))
values_raw.append(float(value))
labels_raw.append(label)
i=i+1
indexes = tf.SparseTensor(indices=indexes_raw,
values=indicators_raw,
dense_shape=[i, num_features])
values = tf.SparseTensor(indices=indexes_raw,
values=values_raw,
dense_shape=[i, num_features])
labels = tf.constant(labels_raw, dtype=tf.int32)
return {"indexes": indexes, "values": values}, labels
return input_function
但是,对于几个GB大小的数据集,我得到以下错误:
如何避免此错误?如何编写输入函数以读取中型稀疏数据集(libsvm格式)?
最佳答案
使用估算器时,对于libsvm数据输入,可以创建密集的index
列表,密集的value
列表,然后使用feature_column.categorical_column_with_identity
和feature_column.weighted_categorical_column
创建要素列,最后,将要素列放入估算器中。也许您输入要素的长度是可变的,您可以使用padded_batch来处理它。
这里是一些代码:
## here is input_fn
def input_fn(data_dir, is_training, batch_size):
def parse_csv(value):
## here some process to create feature_indices list, feature_values list and labels
return {"index": feature_indices, "value": feature_values}, labels
dataset = tf.data.Dataset.from_tensor_slices(your_filenames)
ds = dataset.flat_map(
lambda f: tf.data.TextLineDataset(f).map(parse_csv)
)
ds = ds.padded_batch(batch_size, ds.output_shapes, padding_values=(
{
"index": tf.constant(-1, dtype=tf.int32),
"value": tf.constant(0, dtype=tf.float32),
},
tf.constant(False, dtype=tf.bool)
))
return ds.repeat().prefetch(batch_size)
## create feature column
def build_model_columns():
categorical_column = tf.feature_column.categorical_column_with_identity(
key='index', num_buckets=your_feature_dim)
sparse_columns = tf.feature_column.weighted_categorical_column(
categorical_column=categorical_column, weight_feature_key='value')
dense_columns = tf.feature_column.embedding_column(sparse_columns, your_embedding_dim)
return [sparse_columns], [dense_columns]
## when created feature column, you can put them into estimator, eg. put dense_columns into DNN, and sparse_columns into linear model.
## for export savedmodel
def raw_serving_input_fn():
feature_spec = {"index": tf.placeholder(shape=[None, None], dtype=tf.int32),
"value": tf.placeholder(shape=[None, None], dtype=tf.float32)}
return tf.estimator.export.build_raw_serving_input_receiver_fn(feature_spec)
换句话说,您可以创建自定义功能列,如下所示:_SparseArrayCategoricalColumn
关于TensorFlow输入函数,用于读取稀疏数据(libsvm格式),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47158700/