当我不得不使用Python从PowerPivot模型中读取一些数据时,这项看似微不足道的任务变成了真正的噩梦。我相信在过去的几天里我已经很好地研究了这一点,但是现在我遇到了困难,希望能得到python/ssas/ado社区的帮助。
基本上,我只想通过编程方式访问存储在PowerPivot模型中的原始数据-我的想法是通过下面列出的方法之一连接到基础PowerPivot(即MS Analysis Services)引擎,列出模型中包含的表,然后从每个表中提取原始数据。使用简单的DAX查询(类似于EVALUATE (table_name)
)。别紧张,对吧?嗯,也许不是。
0。一些背景信息
如你所见,我尝试过几种不同的方法。我将尽可能仔细地记录所有内容,以便那些不熟悉PowerPivot功能的人能够很好地了解我想要做什么。
首先,对AnalysisServices引擎的编程访问有一些背景(上面说是2005年的SQL Server,但所有这些都应该仍然适用):SQL Server Data Mining Programmability和Data providers used for Analysis Services connections。
我将在下面的示例中使用的示例Excel/PowerPivot文件可以在以下位置找到:Microsoft PowerPivot for Excel 2010 and PowerPivot in Excel 2013 Samples。
另外,请注意我使用的是Excel2010,所以我的一些代码是特定于版本的。例如,如果使用Excel 2013,则wb.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
应为wb.Model.DataModelConnection.ModelConnection.ADOConnection
。
我将在整个问题中使用的连接字符串是基于此处找到的信息:Connect to PowerPivot engine with C#。另外,一些方法显然需要在数据检索之前对PowerPivot模型进行某种初始化。请参见此处:Automating PowerPivot Refresh operation from VBA。
最后,这里有几个链接表明这应该是可以实现的(但是请注意,这些链接主要是指c,而不是python):
Made connection to PowerPivot DataModel, how can I fill a dataset with it?
Connecting to PowerPivot with C#
2013 C# connection to PowerPivot DataModel
Connecting Tableau and PowerPivot. It just works.(显示外部应用程序实际上可以读取PowerPivot模型数据-请注意,Tableau加载项安装了Interop.ADODB.dll
程序集,我猜这是它用来访问PowerPivot数据的原因)
1。使用ADOMD
import clr
clr.AddReference("Microsoft.AnalysisServices.AdomdClient")
import Microsoft.AnalysisServices.AdomdClient as ADOMD
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = ADOMD.AdomdConnection(ConnString)
Connection.Open()
在这里,问题似乎是PowerPivot模型尚未初始化:
AdomdConnectionException: A connection cannot be made. Ensure that the server is running.
2。使用AMO
import clr
clr.AddReference("Microsoft.AnalysisServices")
import Microsoft.AnalysisServices as AMO
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = AMO.Server()
Connection.Connect(ConnString)
同样的情况,“服务器没有运行”:
ConnectionException: A connection cannot be made. Ensure that the server is running.
请注意,从技术上讲,AMO并不用于查询数据,但我将它作为连接到PowerPivot模型的一种潜在方法包括在内。
三。使用ADO.NET
import clr
clr.AddReference("System.Data")
import System.Data.OleDb as ADONET
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = ADONET.OleDbConnection()
Connection.ConnectionString = ConnString
Connection.Open()
这类似于What's the simplest way to access mssql with python or ironpython?。不幸的是,这也不起作用:
OleDbException: OLE DB error: OLE DB or ODBC error: The following system error occurred:
The requested name is valid, but no data of the requested type was found.
4。通过ADODAPI模块使用ADO
import adodbapi
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = adodbapi.connect(ConnString)
类似于Opposite Workings of OLEDB/ODBC between Python and MS Access VBA。我得到的错误是:
OperationalError: (com_error(-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB
Provider for SQL Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The
following system error occurred: The requested name is valid, but no data of the requested
type was found...
这基本上与上面的ADO.NET相同。
5。通过Excel/win32com模块使用ADO
from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
Connection = Workbook.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
Recordset = Dispatch('ADODB.Recordset')
Query = "EVALUATE(dbo_DimDate)" #sample DAX query
Recordset.Open(Query, Connection)
这种方法的想法来自于使用vba:Export a table or DAX query from Power Pivot to CSV using VBA的博客文章。注意,这种方法使用一个显式的刷新命令来初始化模型(即“服务器”)。以下是错误消息:
com_error: (-2147352567, 'Exception occurred.', (0, u'ADODB.Recordset', u'Arguments are of
the wrong type, are out of acceptable range, or are in conflict with one another.',
u'C:\\Windows\\HELP\\ADO270.CHM', 1240641, -2146825287), None)
但是,似乎已经建立了ADO连接:
type(Connection)
返回instance
print(Connection)
返回Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member Mode=Error;Subqueries=2;Optimize Response=3;Cell Error Mode=TextValue
问题似乎在于创建adodb.recordset对象。
6。通过excel/win32com使用ado,直接使用adodb.connection
from win32com.client import Dispatch
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"
Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnString)
类似于Connection to Access from Python [duplicate]和Query access using ADO in Win32 platform (Python recipe)。不幸的是,python发出的错误与上面两个示例中的相同:
com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The following system
error occurred: The requested name is valid, but no data of the requested type was found.
..', None, 0, -2147467259), None)
7。通过Excel/win32com使用ADO,直接使用ADODB.Connection Plus模型刷新
from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
ConnStringInternal = "Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=
Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX
Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member
Mode=Error;Optimize Response=3;Cell Error Mode=TextValue"
Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnStringInternal)
我希望可以初始化Excel实例,然后初始化PowerPivot模型,然后使用Excel用于嵌入PowerPivot数据的内部连接字符串创建连接(类似于How do you copy the powerpivot data into the excel workbook as a table?—请注意,连接字符串与我在其他地方使用的连接字符串不同)。不幸的是,这不起作用,我猜是python在一个单独的实例中启动adodb.connection进程(当我在不首先初始化Excel的情况下执行最后三行时,会收到相同的错误消息等):
com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'Either the user, ****** (masked), does not have access
to the Microsoft_SQLServer_AnalysisServices database, or the database does not exist.',
None, 0, -2147467259), None)
最佳答案
瞧,我终于设法解决了这个问题——事实证明,使用Python访问PowerPivot数据确实是可能的!下面是我所做的简要回顾-您可以在这里找到更详细的描述:Analysis Services (SSAS) on a shoestring。注意:代码在效率和优雅方面都没有优化。
安装Microsoft Power BI桌面(随附免费的Analysis Services服务器,因此不需要昂贵的SQL Server许可证—但是,如果您有适当的许可证,同样的方法显然也可以工作)。
通过首先创建msmdsrv.ini设置文件启动as引擎,然后从abf文件还原数据库(使用amo.net),然后使用adomd.net提取数据。
下面是演示as-engine+amo.net部分的python代码:
import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas
def initialSetup(pathPowerBI):
sys.path.append(pathPowerBI)
#required Analysis Services assemblies
clr.AddReference("Microsoft.PowerBI.Amo.Core")
clr.AddReference("Microsoft.PowerBI.Amo")
clr.AddReference("Microsoft.PowerBI.AdomdClient")
global AMO, ADOMD
import Microsoft.AnalysisServices as AMO
import Microsoft.AnalysisServices.AdomdClient as ADOMD
def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):
#create random folder
os.chdir(pathTarget)
folder = os.getcwd()+str(random.randrange(10**6, 10**7))
os.mkdir(folder)
#extract PowerPivot model (abf backup)
archive = zipfile.ZipFile(excelName)
for member in archive.namelist():
if ".data" in member:
filename = os.path.basename(member)
abfname = os.path.join(folder, filename) + ".abf"
source = archive.open(member)
target = file(os.path.join(folder, abfname), 'wb')
shutil.copyfileobj(source, target)
del target
archive.close()
#start the cmd.exe process to get its PID
listPIDpre = [proc for proc in psutil.process_iter()]
process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
listPIDpost = [proc for proc in psutil.process_iter()]
pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
pid = str(pid).split("=")[1].split(",")[0]
#msmdsrv.ini
msmdsrvText = '''<ConfigurationSettings>
<DataDir>{0}</DataDir>
<TempDir>{0}</TempDir>
<LogDir>{0}</LogDir>
<BackupDir>{0}</BackupDir>
<DeploymentMode>2</DeploymentMode>
<RecoveryModel>1</RecoveryModel>
<DisklessModeRequested>0</DisklessModeRequested>
<CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
<AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
<Network>
<Requests>
<EnableBinaryXML>1</EnableBinaryXML>
<EnableCompression>1</EnableCompression>
</Requests>
<Responses>
<EnableBinaryXML>1</EnableBinaryXML>
<EnableCompression>1</EnableCompression>
<CompressionLevel>9</CompressionLevel>
</Responses>
<ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
</Network>
<Port>{1}</Port>
<PrivateProcess>{2}</PrivateProcess>
<InstanceVisible>0</InstanceVisible>
<Language>1033</Language>
<Debug>
<CallStackInError>0</CallStackInError>
</Debug>
<Log>
<Exception>
<CrashReportsFolder>{0}</CrashReportsFolder>
</Exception>
<FlightRecorder>
<Enabled>0</Enabled>
</FlightRecorder>
</Log>
<AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
<ResourceGovernance>
<GovernIMBIScheduler>0</GovernIMBIScheduler>
</ResourceGovernance>
<Feature>
<ManagedCodeEnabled>1</ManagedCodeEnabled>
</Feature>
<VertiPaq>
<EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
<EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
</VertiPaq>
</ConfigurationSettings>'''
#save ini file to disk, fill it with required parameters
msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
msmdsrvini.write(msmdsrvText)
msmdsrvini.close()
#run AS engine inside the cmd.exe process
initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
initString = initString.format(pathPowerBI.replace("/","\\"),folder)
process.stdin.write(initString + " \n")
#connect to the AS instance from Python
AMOServer = AMO.Server()
AMOServer.Connect("localhost:{0}".format(port))
#restore database from PowerPivot abf backup, disconnect
AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
AMOServer.Restore(AMORestoreInfo)
AMOServer.Disconnect()
return process
数据提取部分:
def runQuery(query, port, flag):
#ADOMD assembly
ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
ADOMDConn.Open()
ADOMDCommand = ADOMDConn.CreateCommand()
ADOMDCommand.CommandText = query
#read data in via AdomdDataReader object
DataReader = ADOMDCommand.ExecuteReader()
#get metadata, number of columns
SchemaTable = DataReader.GetSchemaTable()
numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount
#get column names
columnNames = []
for i in range(numCol):
columnNames.append(str(SchemaTable.Rows[i][0]))
#fill with data
data = []
while DataReader.Read()==True:
row = []
for j in range(numCol):
try:
row.append(DataReader[j].ToString())
except:
row.append(DataReader[j])
data.append(row)
df = pandas.DataFrame(data)
df.columns = columnNames
if flag==0:
DataReader.Close()
ADOMDConn.Close()
return df
else:
#metadata table
metadataColumnNames = []
for j in range(SchemaTable.Columns.Count):
metadataColumnNames.append(SchemaTable.Columns[j].ToString())
metadata = []
for i in range(numCol):
row = []
for j in range(SchemaTable.Columns.Count):
try:
row.append(SchemaTable.Rows[i][j].ToString())
except:
row.append(SchemaTable.Rows[i][j])
metadata.append(row)
metadf = pandas.DataFrame(metadata)
metadf.columns = metadataColumnNames
DataReader.Close()
ADOMDConn.Close()
return df, metadf
然后通过如下方式提取原始数据:
pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)
关于python - 使用Python从PowerPivot模型中提取原始数据,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/34846090/