一、如何批量检索CID?
使⽤Python和 pubchempy 库,我们可以轻松实现批量检索CID的功能。
在此之前确保你安装了下面代码用到的库文件。
1. 安装pubchempy:通过cmd运行命令来安装。
pip install pubchempy
2. 编写查询脚本:使用Python编写⼀个脚本,该脚本通过 pubchempy 库向PubChem数据库发送查询请求,批量检索化合物的CID。
3. 处理查询结果:将查询得到的CID及相关信息整理并保存,例如输出为Excel文件,以便后续的分析和使用。
4. 如果你需要直接输出excel表格的话,那么你需要确保安装了openpyxl库。
pip install openpyxl
1.读取文件中的化学品名称
import pandas as pd
# 定义CSV⽂件的绝对路径
csv_file_path = r'C:\Users\cy184\Desktop\111.csv'#这⾥写你保存化合物名称的⽂件地址
# 从CSV⽂件中读取数据
data = pd.read_csv(csv_file_path)
# 直接选取包含化学品名称的列
chemical_names = data["Chemical Name"].dropna().tolist()#这⾥dropna⼗分重要,他可
以清除没有数据的⾏列
# 打印提取的化学品名称列表
print("提取的化学品名称列表:")
print(chemical_names)
2.查询CID
这里我原先修改的代码在查询的时候出现了很多重复,修改后发现运行很慢。所以做了以下优化:
1. 分批查询:将化合物名称分成多个批次进行查询,每次查询⼀批,而不是⼀次性查询所有化合物。这可以避免 触发PubChem的速率限制。批次大小为100。
2. 缓存机制: 使用JSON文件缓存查询结果,避免重复查询相同的化合物。如果某个化合物的CID已经查询过,直接从缓存中获取,减少不必要的网络请求。
3. 保存缓存:查询结束后,将缓存保存到⽂件中,以便下次查询时使用。
import pandas as pd
import pubchempy as pcp
import json
# 定义函数查询CID
def get_cids(compound_names, batch_size=100, cache_file='cid_cache.json'):
# 尝试加载缓存
try:
with open(cache_file, 'r') as f:
cid_cache = json.load(f)
except FileNotFoundError:
cid_cache = {}
results = []
for i in range(0, len(compound_names), batch_size):
batch = compound_names[i:i+batch_size]
for name in batch:
if name in cid_cache:
results.append({"Chemical Name": name, "CID": cid_cache[name]})
else:
try:
# 通过化合物名称查询CID
compound = pcp.get_compounds(name, 'name')[0]
cid_cache[name] = compound.cid
results.append({"Chemical Name": name, "CID": compound.cid})
except IndexError:
# 如果没有找到化合物,将CID设置为None
cid_cache[name] = None
results.append({"Chemical Name": name, "CID": None})
except Exception as e:
# 记录其他错误
cid_cache[name] = f"Error: {str(e)}"
results.append({"Chemical Name": name, "CID": f"Error: {str(e)}"})
# 保存缓存
with open(cache_file, 'w') as f:
json.dump(cid_cache, f)
return results
# 查询CID并处理结果
results = get_cids(chemical_names)
results_df = pd.DataFrame(results)
# 去除CID为NaN的行
results_df = results_df.dropna(subset=['CID'])
# 将CID列转换为整数
results_df['CID'] = results_df['CID'].astype(int)
# 将结果保存到Excel文件
output_file_path = 'compound_cids.xlsx'
results_df.to_excel(output_file_path, index=False)
# 打印CID列表
print("\n查询到的CID列表:")
print(list(results_df['CID']))
二、如何通过CID检索其他内容,性质等?
参考这篇文章,Puchem化合物数据批量抓取采集_宝典_教程_Python爬虫
我们可以做出一些修改使得其更普适,来进行化合物各种性质的抓取。
def get_Pubchem(url):
'''
从Pubchem获取化合物的数据。
输入:
url:从pubchem获取的API地址
如https://pubchem.ncbi.nlm.nih.gov/compound/2244
根据化合物的cid,PubChem CID,Compound ID变化而变化
输出:
Pubchem返回的JSON数据
'''
# 导入requests包
import requests
# 获取到的api地址
# url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/2244/JSON/?heading=Names+and+Identifiers'
# 发起请求,get数据
response = requests.get(url=url)
return response
# 解析response中的name
def get_name_value(response, name):
'''
输入:
response
Canonical SMILES、Molecular Formula、CAS、Molecular Weight等
输出:
SMILES、MF、CAS、Molecular Weight值
'''
try:
data = response.text
data = data.split(f'"TOCHeading": "{name}",')[1].split('"Information": [')[1].split('},')[0]
data = data.replace('\n ', '').replace(' ', '')
data
try:
# # 解析Chemical and Physical Properties
Value = data.split('String":"')[1].split('"}')[0]
except:
# 解析Chemical and Physical Properties
Value = data.split('"Number":')[1].split('}')[0].replace('[', '').replace(']', '')
return Value
except:
return None
import time
import pandas as pd
# 从第一段代码获取的CID列表
all_cid = list(results_df['CID'])
df = pd.DataFrame()
for cid in all_cid:
url = f'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON/?heading=Names+and+Identifiers'
response = get_Pubchem(url)
cid_paras = response.json()['Record']['RecordNumber']
compound_name = response.json()['Record']['RecordTitle']
d = {}
d['cid'] = cid
d['compound_name'] = compound_name
d['cid_paras'] = cid_paras
all_name = ['IUPAC Name', 'InChI', 'InChI Key', 'Canonical SMILES', 'Molecular Formula', 'CAS']
for name in all_name:
Value = get_name_value(response=response, name=name)
d[name.replace(' ', '_')] = Value
df_tmp = pd.DataFrame([d])
df = pd.concat([df, df_tmp])
print(f'cid: {cid}, status_code: {response.status_code}, df: {df.shape[0]}')
time.sleep(0.5)
df = df.reset_index(drop=True)
df
import time
import pandas as pd
# 从第一段代码获取的CID列表
all_cid = list(results_df['CID'])
Property_Name = [
'Molecular Weight', 'XLogP3', 'Hydrogen Bond Donor Count',
'Hydrogen Bond Acceptor Count', 'Rotatable Bond Count', 'Exact Mass',
'Monoisotopic Mass', 'Topological Polar Surface Area', 'Heavy Atom Count',
'Formal Charge', 'Complexity', 'Isotope Atom Count',
'Defined Atom Stereocenter Count', 'Undefined Atom Stereocenter Count',
'Defined Bond Stereocenter Count', 'Undefined Bond Stereocenter Count',
'Covalently-Bonded Unit Count', 'Compound Is Canonicalized'
]
df = pd.DataFrame()
for cid in all_cid:
url = f'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON/?heading=Computed+Properties'
response = get_Pubchem(url)
cid_paras = response.json()['Record']['RecordNumber']
compound_name = response.json()['Record']['RecordTitle']
d = {}
d['cid'] = cid
d['compound_name'] = compound_name
d['cid_paras'] = cid_paras
for name in Property_Name:
Value = get_name_value(response=response, name=name)
d[name.replace(' ', '_').replace('-', '_')] = Value
df_tmp = pd.DataFrame([d])
df = pd.concat([df, df_tmp])
print(f'cid: {cid}, status_code: {response.status_code}, df: {df.shape[0]}')
time.sleep(0.5)
df = df.reset_index(drop=True)
df
当然,只要根据你的要求对代码进行简单修改,就可以获得你想要的其他信息。
*部分代码来源网络,侵删