如何避免将数据复制到mysql

本文介绍了如何避免将数据复制到mysql的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了代码以从ebay.com上报废汽车信息(标题，品牌，型号，变速器，年份，价格)数据，并保存在mysql中，我想如果所有行的(标题，制造商，模型，...)项都与另一行相似，则避免将此数据插入到mysql中，*仅当所有行的项都相似时(因为某些标题是相似的或某种模型)或...)

I write the code to scrap car info(title, make, model, transmission, year, price) data from ebay.com and save in the mysql,I want if all row's(title, make, model, ...) item's is be similar to another row then avoid to insert this data to the mysql, *only when all row's item be similar(because some title is simialr or some model or...)

代码:

import requests
from bs4 import BeautifulSoup
import re
import mysql.connector

conn = mysql.connector.connect(user='root', password='******',
host='127.0.0.1', database='web_scraping')
cursor = conn.cursor()
url = 'https://www.ebay.com/b/Cars-Trucks/6001?_ fsrp=0&_sacat=6001&LH_BIN=1&LH_ItemCondition=3000%7C1000%7C2500&rt=nc&_stpos=95125&Model%2520Year=2020%7C2019%7C2018%7C2017%7C2016%7C2015'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
ebay_cars = soup.find_all('li', class_='s-item')
for car_info in ebay_cars:
    title_div = car_info.find('div', class_='s-item__wrapper clearfix')
    title_sub_div = title_div.find('div', class_='s-item__info clearfix')
    title_p = title_sub_div.find('span', class_='s-item__price')
    title_tag = title_sub_div.find('a', class_='s-item__link')
    title_maker = title_sub_div.find('span', class_='s-item__dynamic s-
    item__dynamicAttributes1')
    title_model = title_sub_div.find('span', class_='s-item__dynamic s-
    item__dynamicAttributes2')
    title_trans = title_sub_div.find('span', class_='s-item__dynamic s-
    item__dynamicAttributes3')



name_of_car = re.sub(r'\d{4}', '', title_tag.text)
maker_of_car = re.sub(r'Make: ','', title_maker.text)
model_of_car = re.sub(r'Model: ', '', title_model.text)
try:
    if title_trans.text.startswith(r'Transmission: '):
        trans_of_car = re.sub(r'Transmission: ', '', title_trans.text)
    else:
        trans_of_car = ''
except AttributeError:
    trans_of_car = ''
year_of_car = re.findall(r'\d{4}', title_tag.text)
year_of_car = ''.join(str(x) for x in year_of_car)

price_of_car = title_p.text
print(name_of_car ,trans_of_car )
sql = 'INSERT INTO car_info(Title, Maker, Model, Transmission, Year, Price)
VALUES (%s, %s, %s, %s, %s, %s)'
cursor.execute(sql , (name_of_car, maker_of_car, model_of_car, trans_of_car,
year_of_car, price_of_car))



conn.commit()
conn.close()

推荐答案

一个选项使用not exists:

insert into car_info (title, maker, model, transmission, year, price)
select v.*
from (select %s title, %s maker, %s model, %s transmission, %s year, %s price) v
where not exists (
    select 1
    from car_info c
    where
        (c.title, c.maker, c.model, c.transmission, c.year, c.price)
         = (v.title, v.maker, v.model, v.transmission, v.year, v.price)
);

但是在表的所有列上创建唯一键会更简单，例如:

But it would be simpler to create a unique key on all columns of the table, like:

create unique index idx_car_info_uniq
    on car_info(title, maker, model, transmission, year, price);

这可以防止任何进程在表中插入重复项.您可以优雅地忽略使用on duplicate key语法会产生的错误:

This prevents any process from inserting duplicates in the table. You can elegantly ignore the erros that would otherwise have been raised with the on duplicate key syntax:

insert into car_info (title, maker, model, transmission, year, price)
values (%s, %s, %s, %s, %s, %s)
on duplicate key update title = values(title);

这篇关于如何避免将数据复制到mysql的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！