I am currently in the process of doing some webscraping of Yahoo finance using Selenium. In the process of doing so, I discovered a discrepancy between the HTML received from the GET method of the driver and the HTML as seen when doing a right click + inspect of the source HTML when viewing the same web page in the browser.
From what i could glean from reading online sources, this is all down to the fact that Javascript + CSS is used to dynamically generate content on the page. Again, reading up on related posts, the solution basically boils down to “wait until the content you are looking for has been loaded”,then get the HTML. However, despite following guides, tutorials and threads on Stackoverflow, none of the approaches seem to be able to get the HTML of a Yahoo finance ticker webpage, as seen from a browser.
For sake of narrowing down the scope of the question, let’s suppose that the HTML i want to be available is the buy/sell/hold rating of a stock and the stock/ticker I am interested in is Affirm Holdings inc (AFRM):
And the associated HTML is here:
The HTML i am interested in retrieving
At the present, this is the code that has gotten the closest to achieving what i want:
'''Import necessary selenium components '''
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
'''Make the driver run headless i.e do not actually open it in a browser window, running
it entirely in the background'''
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
'''Wait to ensure all the web elements are loaded'''
driver.implicitly_wait(10)
driver.maximize_window()
'''Get the desired webpage '''
driver.get("https://finance.yahoo.com/quote/AFRM?p=AFRM&.tsrc=fin-srch")
'''Handle the cookie consent popup that appears.
Find the button you want to click. This is done by finding an element with the name
reject'''
button = driver.find_element(By.NAME,'reject')
'''Click the button - by first scrolling down and then clicking '''
ActionChains(driver).move_to_element(button).click().perform()
'''Wait until the HTML element with id=mrt-node-Col2-10-QuoteModule is visible'''
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.ID, "mrt-node-Col2-10-QuoteModule")))
'''Get the underlying HTML '''
html = driver.page_source
Why I am convinced this ought to work and confused why it doesn’t
When inspecting the HTML source code as seen in the browser, you can see that the html parts that define the analyst rating graphs has the following component:
id=mrt-node-Col2-10-QuoteModule
Which can be seen here, when inspecting the HTML
And my code does the folowing
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.ID, "mrt-node-Col2-10-QuoteModule")))
Which if i have not misunderstood things, waits for 10 seconds or the until the HTML component with the id “mrt-node-Col2-10-QuoteModule” is visible
However, when i inspect the underlying HTML, it still appears that i get the HTML without said components I.E i am still getting the HTML without anything JavaScript rendered. What is it that i am doing wrong here?
did you try to use python module
yfinance
? It doesn’t need to useSelenium
some servers may send different HTML for different users and for different devices (desktop, tablet, phone). They can also use some random values to stop scripts/bots/hackers/spamers.
maybe test code without
"--headless"
to see what you really get in browser.