csv将财务数据导入Python

csv将财务数据导入Python

本文介绍了使用read_csv将财务数据导入Python Pandas的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个.csv具有以下结构:

I have a .csv with the following structure:

date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count
1/1/2008,1/31/2008,Name_1,1001,"123,456","$7,890,123.45","67,890"
2/1/2008,2/29/2008,Name_1,1001,"43,210","$987,654.32","109,876"
3/1/2008,3/31/2008,Name_1,1001,"485,079","$1,265,789,433.98","777,888"
...
12/1/2008,12/31/2008,Name_1,1001,"87,543","$432,098,987","87,987"
1/1/2008,1/31/2008,Name_2,1002,"268,456","$890,123.45","97,890"
2/1/2008,2/29/2008,Name_2,1002,"53,210","$987,654.32","109,876"
...
etc

我试图通过使用下面的代码将它们读入一个pandas数据框:

I am trying to read them into into a pandas dataframe by using the following code:

import pandas as pd

data = pd.read_csv('my_awesome_csv.csv'),parse_dates=[[0,1]],
                   infer_datetime_format=True)

精细,除了我想控制每列中的数据类型。当我在解释器中运行以下代码时,我发现引号中的数字不会被识别为数字,无论是美元还是其他。

This works just fine except that I would like to control the data types in each column. When I run the following code in the interpreter I discover that the numbers in quotes do not get recognized as numbers, either dollars or otherwise.

In [10]: data.dtypes
Out[10]:
date_begin_date_end       object
name                      object
name_code                  int64
active_accounts           object  # Problem, I want this to be a number
transaction_amount        object  # Ditto, I want this to be a number (it's a dollar amount)
transaction_count         object  # Still a number!
dtype: object

我已经在,但没有找到我在看的内容用于声明类型,这些类型是在csv中以逗号和美元符号的形式保存为字符串时的金额。

I have done some snooping around in the Pandas csv documentation but haven't found what I'm looking for about declaring types that are amounts when they are saved as strings with commas and dollar signs in the csv. My ultimate goal here is to be able to do some arithmetic operations on the values in these columns.

任何想法?

推荐答案

您可以使用在调用
read_csv 后解析这些列:

You could use vectorized string methods to parse those columns after the call toread_csv:

import pandas as pd
import decimal
D = decimal.Decimal

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True)

for col in ('active_accounts', 'transaction_count'):
    data[col] = data[col].str.replace(r',', '').astype(int)

data['transaction_amount'] = (data['transaction_amount']
                              .str.replace(r'[^-+\d.]', '').astype(D))


print(data.dtypes)
# date_begin_date_end    object
# name                   object
# name_code               int64
# active_accounts         int64
# transaction_amount     object
# transaction_count       int64
# dtype: object

print(data)

产生

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456
1    2/1/2008 2/29/2008  Name_1       1001            43210
2    3/1/2008 3/31/2008  Name_1       1001           485079
3  12/1/2008 12/31/2008  Name_1       1001            87543
4    1/1/2008 1/31/2008  Name_2       1002           268456
5    2/1/2008 2/29/2008  Name_2       1002            53210

  transaction_amount  transaction_count
0         7890123.45              67890
1          987654.32             109876
2      1265789433.98             777888
3          432098987              87987
4          890123.45              97890
5          987654.32             109876

PS。 read_csv 确实有
可以提供一个函数来解析有问题的列。这些
函数对每个字符串调用一次。如果你有很多行,那可能需要很多Python函数
调用。使用向量化字符串方法处理列,如上所示
应该快得多。

PS. read_csv does have a convertersparameterwith which you could supply a function to parse the problematic columns. Thesefunctions are called once for each string. If you have a lot of rows, that could be require a lot of Python functioncalls. Handling the columns using vectorized string methods, as shown aboveshould be far quicker.

import pandas as pd
import re
import decimal
D = decimal.Decimal

def make_parser(cls):
    def parse_commas(text):
        return cls(re.sub(r'[^-+\d.]', '', text))
    return parse_commas

to_int = make_parser(int)
to_decimal = make_parser(D)

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True
                   , converters={4: to_int, 5: to_decimal, 6: to_int})

print(data)

>

yields

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456
1    2/1/2008 2/29/2008  Name_1       1001            43210
2    3/1/2008 3/31/2008  Name_1       1001           485079
3  12/1/2008 12/31/2008  Name_1       1001            87543
4    1/1/2008 1/31/2008  Name_2       1002           268456
5    2/1/2008 2/29/2008  Name_2       1002            53210

  transaction_amount  transaction_count
0         7890123.45              67890
1          987654.32             109876
2      1265789433.98             777888
3          432098987              87987
4          890123.45              97890
5          987654.32             109876

,并且 transaction_amount 列中的值为dec​​imal.Decimals:

and the values in the transaction_amount column are decimal.Decimals:

In [64]: data.loc[0, 'transaction_amount']
Out[64]: Decimal('7890123.45')

这篇关于使用read_csv将财务数据导入Python Pandas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-30 10:34