I am trying to scrape a particular website https://birdeye.so/find-gems?chain=solana, but unable to load the data within the table. I am only able to get the table’s headers, such as Token
, Trending
, etc.
Are some pages just impossible to scrape? If so, why exactly?
Below is my code. I’ve attempted to scrape this page using Selenium, but am unable to load all of the contents. What am I doing wrong?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from bs4 import BeautifulSoup
import requests
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://birdeye.so/find-gems?chain=solana/")
html = driver.page_source
soup = BeautifulSoup(html)
print(soup)
the codes below should work
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver import EdgeOptions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
url = r"https://birdeye.so/find-gems?chain=solana/"
service=Service(executable_path = r'C:\Users\10696\Desktop\access\zhihu\msedgedriver\msedgedriver.exe')
edge_options = EdgeOptions()
edge_options.add_experimental_option('excludeSwitches', ['enable-automation'])
edge_options.add_experimental_option('useAutomationExtension', False)
edge_options.add_argument('lang=zh-CN,zh,zh-TW,en-US,en')
edge_options.add_argument("disable-blink-features=AutomationControlled")#
driver = webdriver.Edge(options=edge_options, service = service)
driver.get(url)
WebDriverWait(driver, timeout=10).until(lambda d: d.find_element(By.CLASS_NAME, "ant-table-cell"))
pag = driver.find_element(By.CLASS_NAME, "ant-table-tbody")
pag = driver.execute_script("return arguments[0].innerHTML;", pag)
table = soup(pag, "html.parser")
Is this maybe a race condition? Looking at the page, it creates the table headers, sends an AJAX request to get the information in the table, then populates the table. I think this might be getting the page source after the table is created but before the table is populated. Does adding a wait help?
“Impossible” is a strong condition, but some pages are extremely difficult to scrape. If the page is protected by Cloudflare, detects you as a bot and serves up a Captcha, that’s not easy to bypass, for obvious reasons. You might want to use a proxy if this is the case. Nick’s suggestion is the first thing to try, though.