如何将Cassandra Map转换为Pandas Dataframe

本文介绍了如何将Cassandra Map转换为Pandas Dataframe的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从类型为map<string, int>的cassandra列系列中读取数据，并将其转换为Pandas数据框.我还想在虹膜种类分类中的此处中提到的那一点，以便在python中训练模型.

I want to read the data from cassandra column family of type map<string, int> and want to convert it to Pandas dataframe. Which further i want to use to train the model in python as mentioned here in classification of iris species.

如果，我会使用csv来训练模型.那么它会看起来像这样:

If, i would have used the csv to train the model. Then it would have looked like this:

label,  f1, f2, f3, f4, f5
  0  ,  11 , 1, 6 , 1,  2
  1  ,  5,   5, 1 , 2,  6
  0  ,  12,  9, 3 , 6,  8
  0  ,  9,  3,  8,  1,  0

Cassandra列系列:

Cassandra column family :

                  FeatureSet                    |   label

{'f1': 11, 'f2': 1, 'f3': 6, 'f4': 1, 'f5': 2}  |     0
{'f1': 5, 'f2':  5, 'f3': 1, 'f4': 2, 'f5': 6}  |     1
{'f1': 12, 'f2': 9, 'f3': 3, 'f4': 6, 'f5': 8}  |     0
{'f1': 9, 'f2': 3, 'f3': 8, 'f4': 1, 'f5': 0}   |     0

代码:

import pandas as pd
from sklearn2pmml import PMMLPipeline
from sklearn.tree import DecisionTreeClassifier
from cassandra.cluster import Cluster

CASSANDRA_HOST = ['172.16.X.Y','172.16.X1.Y1']
CASSANDRA_PORT = 9042
CASSANDRA_DB = "KEYSPACE"
CASSANDRA_TABLE = "COLUMNFAMILY"

cluster = Cluster(contact_points=CASSANDRA_HOST, port=CASSANDRA_PORT)
session = cluster.connect(CASSANDRA_DB)

sql_query = "SELECT * FROM {}.{};".format(CASSANDRA_DB, CASSANDRA_TABLE)

df = pd.DataFrame()

for row in session.execute(sql_query):
            What should i write here and get X_train, Y_train in pandas dataframe



iris_pipeline = PMMLPipeline([
    ("classifier", DecisionTreeClassifier())
])
iris_pipeline.fit(X_train, Y_train)

推荐答案

我发布了一个有效的解决方案输入相同的问题，以将OrderedMapSerializedKey卡桑德拉映射字段作为字典读入数据框.

I posted a working solution here for the same question to read OrderedMapSerializedKey Cassandra map field as a dict into your dataframe.

在先前的解决方案中，我仅替换了Cassandra数据集的第一行(第0行)(rows是元组列表，其中每个元组都是Cassandra中的一行)

In previous solution I replaced only the first (0th) row of Cassandra dataset (rows are list of tuples where every tuple is a row in Cassandra)

from cassandra.util import OrderedMapSerializedKey

def pandas_factory(colnames, rows):

    # Convert tuple items of 'rows' into list (elements of tuples cannot be replaced)
    rows = [list(i) for i in rows]

    # Convert only 'OrderedMapSerializedKey' type list elements into dict
    for idx_row, i_row in enumerate(rows):

        for idx_value, i_value in enumerate(i_row):

            if type(i_value) is OrderedMapSerializedKey:

                rows[idx_row][idx_value] = dict(rows[idx_row][idx_value])

    return pd.DataFrame(rows, columns=colnames)

这篇关于如何将Cassandra Map转换为Pandas Dataframe的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！