问题描述
我需要做一些房地产市场研究,为此需要了解新房的价格和其他价值.
I need to do some real estate market research and for this in need the prices, and other values from new houses.
所以我的想法是访问我获取信息的网站.转到 Main-Search-Site 并抓取所有 RealEstateID,这些 ID 可以将我直接导航到每个房屋的单个页面,然后我可以在其中提取我需要的信息.
So my idea was to go on the website where i get the information.Go to the Main-Search-Site and scrape all the RealEstateIDs that would navigate me directly to the single pages for each house where i can than extract my infos that i need.
我的问题是如何从主页获取所有房地产 ID 并将它们存储在列表中,以便我可以在下一步中使用它们来构建 URL 以访问实际站点.
My problem is how do i get all the real estate ids from the main page and store them in a list, so i can use them in the next step to build the urls with them to go to the acutal sites.
我用 beautifulsoup 尝试过,但失败了,因为我不明白如何搜索特定单词并提取它后面的内容.
I tried it with beautifulsoup but failed because i dont understand how to search for a specific word and extract what comes after it.
html 代码如下所示:
The html code looks like this:
""realEstateId":110356727,"newHomeBuilder":"false","disabledGrouping":"false","resultlist.realEstate":{"@xsi.type":"search:ApartmentBuy","@id":"110356727","title":"
由于值realEstateId"出现了大约 60 次,我想每次都抓取它后面的数字(此处:110356727)并将其存储在列表中,以便我以后可以使用它们.
Since the value "realEstateId" appears around 60 times, i want to scrape evertime the number (here: 110356727) that comes after it and store it in a list, so that i can use them later.
import time
import urllib.request
from urllib.request import urlopen
import bs4 as bs
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests
from requests import get
url = 'https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
def expose_IDs():
resp = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('resultListModel')
tickers = []
for row in table.findAll('realestateID')[1:]:
ticker = row.findAll(',')[0].text
tickers.append(ticker)
with open("exposeID.pickle", "wb") as f:
pickle.dump(tickers, f)
return tickers
expose_IDs()
推荐答案
类似这样的事情?字典中有 68 个键是 id.我使用正则表达式抓取与您相同的脚本并修剪不需要的字符,然后使用 json.loads
加载并访问 json 对象,如底部图像所示.
Something like this? There are 68 keys in a dictionary that are ids. I use regex to grab the same script as you are after and trim of an unwanted character, then load with json.loads
and access the json object as shown in image at bottom.
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
#resultListModel:
results = json.loads(script)
ids = list(results['searchResponseModel']['entryInformation'].keys())
print(ids)
ID:
自网站更新以来:
import requests
import json
from bs4 import BeautifulSoup as bs
import re
res = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Kauf/Nordrhein-Westfalen/Duesseldorf/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/-/true?enteredFrom=result_list')
soup = bs(res.content, 'lxml')
r = re.compile(r'resultListModel:(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip(',')
results = json.loads(script)
ids = [item['@id'] for item in results['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']]
print(ids)
这篇关于如何从网页中抓取特定 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!