具有字典列表的列的pyarrow数据类型?

本文介绍了具有字典列表的列的pyarrow数据类型?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我保存到镶木地板文件时，是否应该为包含字典列表的列使用特殊的 pyarrow 数据类型?

如果我将列表或字典列表保存为字符串，我通常必须 .apply(eval) 该字段，如果我再次将其读入内存，以便熊猫将数据识别为列表(所以我可以用 pd.json_normalize 对其进行标准化)

列_a:

[{id":something"，value":else"}，{id":something2"，value":else2"}，]

column_b:

[test"、test2"、test3"]

只是想知道我是否应该将这些数据保存为字符串以外的其他内容.

编辑 - 从 Zendesk 粘贴一些原始 JSON 的片段.audits 字段有一个名为 events 的字段，它是一个字典列表.在里面，也可以有其他字典列表(附件，里面有一个名为缩略图的字典列表)

你能用 pa.map_ 来处理这样的情况吗?我有时需要从这些嵌套字段中检索数据，而这些数据最初我什至不知道存在.在我当前的镶木地板数据集中，events 字段只是一列(字符串类型)即使其中有许多嵌套字段.

udt = pa.map_(pa.string(), pa.string())

 审计":{id":，ticket_id":，created_at":"，author_id":，事件":[{id":，类型":"，author_id":，身体":",plain_body":"，公开":假，附件":[{网址":"，id":，文件名":"，content_url":"，content_type":图像/png"，大小":2888，宽度":100，高度":30，内联":假，已删除":假，缩略图":[{网址":"，id":，文件名":"，content_url":"，mapped_content_url":"，content_type":图像/png"，尺寸":2075，宽度":80，高度":24，内联":假，已删除":假}]},

解决方案

假设您有一个带有dictionary"的 df和字符串列，以及字典都有相同的键(id，在你的情况下是值):

df = pd.DataFrame({'col1': pd.Series([{id":something"，value":else"}，{id":something2"，value":else2"}]),'col2': pd.Series(['foo', 'bar'])})udt = pa.struct([pa.field('id', pa.string()), pa.field('value', pa.string())])schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])table = pa.Table.from_pandas(df, schema)df = table.to_pandas()

如果你的字典没有相同的键或者你事先不知道字典的键，你可以这样做:

df = pd.DataFrame({'col1': pd.Series([[('id', 'something'), ('value', '"else')],[('id', 'something2'), ('value','else2')],]),'col2': pd.Series(['foo', 'bar'])})udt = pa.map_(pa.string(), pa.string())schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])table = pa.Table.from_pandas(df, schema)

请注意 col1 的格式不同(它使用对列表而不是字典).您也不能将您的表转换回熊猫，因为它不受支持(尚):

table.to_pandas()>>>ArrowNotImplementedError:对于 map 类型的 Arrow 数据，没有已知的等效 Pandas 块.是已知的.

Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file?

If I save lists or lists of dictionaries as a string, I normally have to .apply(eval) the field if I read it into memory again in order for pandas to recognize the data as a list (so I can normalize it with pd.json_normalize)

column_a:

[
 {"id": "something", "value": "else"},
 {"id": "something2", "value": "else2"},
]

column_b:

["test", "test2", "test3"]

Just wondering if I should save this data as something else besides a string.

Edit - pasting a snippet of some raw JSON from Zendesk. The audits field has a field called events which is a list of dictionaries. Inside that, there can be other lists of dictionaries as well (attachments and inside that there is a list of dictionaries called thumbnails)

Are you able to use pa.map_ to handle situations like this? I sometimes need to retrieve data from these nested fields which I do not even know exist initially. In my current parquet dataset, the events field is just a single column(string type) even though there are many nested fields within it.

udt = pa.map_(pa.string(), pa.string())

  "audit": {
    "id": ,
    "ticket_id": ,
    "created_at": "",
    "author_id": ,
    "events": [
      {
        "id": ,
        "type": "",
        "author_id": ,
        "body": "" ,
        "plain_body": "",
        "public": false,
        "attachments": [
          {
            "url": "",
            "id": ,
            "file_name": "",
            "content_url": "",
            "content_type": "image/png",
            "size": 2888,
            "width": 100,
            "height": 30,
            "inline": false,
            "deleted": false,
            "thumbnails": [
              {
                "url": "",
                "id": ,
                "file_name": "",
                "content_url": "",
                "mapped_content_url": "",
                "content_type": "image/png",
                "size": 2075,
                "width": 80,
                "height": 24,
                "inline": false,
                "deleted": false
              }
            ]
          },

解决方案

Assuming you have a df with "dictionary" and string columns, and the dictionaries all have the same keys (id, value in your case):

df = pd.DataFrame({
        'col1': pd.Series([
            {"id": "something", "value": "else"},
            {"id": "something2", "value": "else2"}
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.struct([pa.field('id', pa.string()), pa.field('value', pa.string())])
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)
df = table.to_pandas()

If your dictionaries don't have the same keys or you don't know the keys of the dictionaries in advance, you can do this:

df = pd.DataFrame({
        'col1': pd.Series([
            [('id', 'something'), ('value', '"else')],
            [('id', 'something2'), ('value','else2')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)

Note that the format for col1 is different (it is using a list of pairs instead of a dict).Also you can't convert your table back to pandas as it is not supported (yet):

table.to_pandas()
>>> ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type map<string, string> is known.

这篇关于具有字典列表的列的pyarrow数据类型?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！