我正在使用RDKit根据具有SMILE结构的两个分子列表之间的Tanimoto系数计算分子相似性。
现在,我可以从两个单独的csv文件中提取SMILE结构。我想知道如何将这些结构放入RDKit的指纹模块中,以及如何计算两个分子列表之间一对一的相似性?
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), ... Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
DataStructs.FingerprintSimilarity(fps[0],fps[1])
我想将我拥有的所有SMILE结构(超过10,000个)放入“ ms”列表中,并获取其指纹。
然后,我将比较两个列表中每对分子之间的相似性,也许这里需要一个for循环?
提前致谢!
我使用了pandas数据框来选择和打印带有结构的列表,然后将列表保存到list_1和list_2中。当它运行到ms1行时,它具有以下错误:
TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string<wchar_t,
std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float
然后,我检查了文件,在“微笑”列中只有“微笑”。但是当我手动将一些分子结构放入列表中进行测试时,仍然存在关于
fpArgs['minSize'].
例如,加多二酰胺的SMILES为“ O = C1 [O-] [Gd + 3] 234567 [O] = C(C [N] 2(CC [N] 3(CC([O-] 4)= )CC [N] 5(CC(= [O] 6)NC)CC(= O)[O-] 7)C1)NC”,错误代码如下(运行fps行时):
ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType, int, int, int, int, int, float, int)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1,
unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2,
bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True,
bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0,
boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None).
如果原始csv文件如下,如何在输出文件中包括分子名称以及相似度值:
名称,微笑,值,值2
分子1,CCOCN(C)(C),0.25,A
分子2,CCO,1.12,B
分子3,COC,2.25,C
我添加了以下代码以在输出文件中包括分子名称,这些名称涉及一些数组值错误(尤其是对于d2):
name_1 = df_1['id1']
name_2 = df_2['id2']
name_3 = pd.concat([name_1, name_2])
# create a list for the dataframe
d1, qu, d2, ta, sim = [], [], [], [], []
for n in range(len(fps)-1):
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:])
#print(c_smiles[n], c_smiles[n+1:])
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
d1.append(name_3[n])
d2.append(name_3[n+1:][m])
#print()
d = {'ID_1':d1, 'query':qu, 'ID_2':d2, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
for index, row in df.iterrows():
print (row["ID_1"], row["query"], row["ID_2"], row["target"], row["Similarity"])
print(df_final)
# save as csv
df_final.to_csv('RESULT_3.csv', index=False, sep=',')
最佳答案
编辑答案以捕获所有评论。
RDKit具有相似性的批量功能,因此您可以将一个指纹与一系列指纹进行比较。只需遍历指纹列表即可。
如果CSV看起来像这样
带有无效SMILES的第一个csv
smiles,value,value2
CCOCN(C)(C),0.25,A
CCO,1.12,B
COC,2.25,C
具有正确SMILES的第二个CSV
smiles,value,value2
CCOCC,0.55,D
CCCO,2.58,E
CCCCO,5.01,F
这是读取SMILES,删除无效的SMILES,进行指纹相似度(无重复)并保存排序后的值的方法。
from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd
# read and Conconate the csv's
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
df_3 = pd.concat([df_1, df_2])
# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
try:
cs = Chem.CanonSmiles(ds)
c_smiles.append(cs)
except:
print('Invalid SMILES:', ds)
print()
# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]
# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
# the list for the dataframe
qu, ta, sim = [], [], []
# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
# collect the SMILES and values
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
print()
# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
print(df_final)
# save as csv
df_final.to_csv('third.csv', index=False, sep=',')
打印输出:
Invalid SMILES: CCOCN(C)(C)C
CCO ['COC', 'CCOCC', 'CCCO', 'CCCCO']
COC ['CCOCC', 'CCCO', 'CCCCO']
CCOCC ['CCCO', 'CCCCO']
CCCO ['CCCCO']
query target Similarity
9 CCCO CCCCO 0.769231
2 CCO CCCO 0.600000
1 CCO CCOCC 0.500000
7 CCOCC CCCO 0.466667
3 CCO CCCCO 0.461538
8 CCOCC CCCCO 0.388889
4 COC CCOCC 0.333333
5 COC CCCO 0.272727
0 CCO COC 0.250000
6 COC CCCCO 0.214286