I have a Selenium script to navigate to a webpage and download it, and then I would go over it with Beautifulsoup and find get the next <p>
tag after the <p>
tag containing ‘Live Updates:’.
However, Selenium is not fully rendering the page, both in headless and a regular. It for some reason leaves out a whole bunch of <p>
tags, they just aren’t there.
How can I make Selenium load the whole page?
URL: https://www.israelnationalnews.com/news/378955
Code:
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import time
def get_webpage():
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--no-sandbox")
#chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument('--disable-dev-shm-usage')
# Create a new WebDriver with the specified service and options
driver = webdriver.Chrome(options=chrome_options)
url="https://www.israelnationalnews.com/news/378955"
driver.get(url) # Open the webpage in the browser
time.sleep(10)
webpage_content = driver.page_source
driver.quit()
# Step 2: Parse the HTML with BeautifulSoup
soup = BeautifulSoup(webpage_content, 'html.parser')
target_p_tag = soup.find('p', text="Live Updates:")
# Extract the text from the next <p> tag
if target_p_tag:
next_p_tag = target_p_tag.findNext('p')
if next_p_tag:
extracted_text = next_p_tag.get_text()
print(extracted_text)
else:
print("No next <p> tag found.")
else:
print("No <p> tag with 'Live Updates:' found.")
I recommend opening up the webpage in your browser and using inspect element to see how it is structured – I find this whole question quite perplexing, the elements that Selenium isn’t getting is just regular <p>
tags.
u are trying to find incorrectly. There is no P tag with text ‘Live Updates:’. Try do soup.find(‘strong’, string=’Live Updates:’). But there is an easier option, u can use regex
import requests
import re
from bs4 import BeautifulSoup
response = requests.get('https://www.israelnationalnews.com/news/378955')
soup = BeautifulSoup(response.text, 'lxml')
all_news=" ".join([news.get_text().replace(': ', '') for news in soup.find_all('p', class_="")])
dates = re.findall(r'\S+, \d+:\d+ \S.\S\W+', all_news)
news_description = [news for news in re.split(r'\S+, \d+:\d+ \S.\S\W+', all_news)]
news_description.pop(0)
for i, news in enumerate(news_description):
print(dates[i].strip(), news.strip())
OUTPUT:
Sunday, 2:50 p.m. Red alert sirens activated in the Gaza periphery.
Sunday, 2:45 p.m. Tisir al-Jouti, a senior member of the Islamic Jihad terrorist organization, was reportedly killed with his family in an IDF airstrike in Rafah.
Sunday, 2:40 p.m. Red alert sirens activated in Netiv HaAsara.
Sunday, 2:30 p.m. Red alert sirens activated in the Gaza periphery
Sunday, 2:10 p.m. IAF aircraft directed by the IDF attacked military buildings used by the terrorist organization Hamas in Gaza. In addition, anti-tank posts, observation posts, and military infrastructure were attacked. Security forces eliminated terrorists who shot at them as well as terrorists identified on the coastline in the Zikim area in the Gaza periphery.
Sunday, 12:57 p.m. The IDF distributed leaflets in the Gaza Strip calling on terrorists to turn themselves in. The leaflets read"Disarm, raise your hands, if possible - wave a white flag. Act in accordance with the IDF's instructions and there will be no need to bring food and water with you - we will take care of that."
Sunday, 12:19 p.m. Following the rocket barrage on central Israel, a fire was caused in a building in Ramat Gan, and a man was injured while running to a protected area.
Sunday, 12:01 p.m. Red alert sirens activated in Tel Aviv, Ra'anana, Holon, and many other communities in central Israel.
Sunday, 11:04 a.m. The Red Crescent in Gaza claims that the IDF ordered the immediate evacuation of Al-Quds Hospital in Gaza.
...
Same problem. Been trying to figure it out, no luck.