I’m trying to scrape the links located in the middle of this webpage under the title, By City
, using the requests module.
The script fails miserably. Although the status code is 200, I end up getting nothing on the console.
Here is what I’ve tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://www.nursinghomes.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'Referer': 'https://www.nursinghomes.com/',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9'
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
for item in soup.select("li > strong > a[class^='text-link']"):
print(item.get("href"))
Expected results (partial):
/tx/austin/
/md/baltimore/
/ny/bronx/
/ny/brooklyn/
/il/chicago/
If you add print(res.text)
you’ll see the following HTML response:
... Request unsuccessful. Incapsula incident ID: 15190006 ...
The websites detects the use of a non-browser, and rejects the request, so bs4 does not receive any useful HTML to parse.
You’ll need to find a way to trick the website in trusting the request.