I’ve set up an HttpClient to retrieve meta data from websites to build a preview when a url is posted in a user message to my website. This works with links to most sites but a significant number do not return the data expected.
For example https://www.opendemocracy.net/en/uk/ – both Facebook and Twitter retrieve the meta data and display a preview but the html returned to C# httpClient does not include the required meta data and title is “Attention Required”
<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
I’ve been checking against HeyMeta and it too fails to retrieve the meta – https://www.heymeta.com/url/www.opendemocracy.net/en/url/www.opendemocracy.net/en/uk/
Here’s my code to retrieve the full html from which the meta can be extracted from the Head
HttpResponseMessage response = await httpClient.GetAsync(uri).ConfigureAwait(false);
HttpContent content = response.Content;
var html = content.ReadAsStringAsync().Result;
Example of the meta returned successfully
<title>Forge Bridge Cottage | Coniston</title>
<meta property="og:title" content="Forge Bridge Cottage | Coniston">
<meta property="og:description" content="Coppermines Cottages | Forge Bridge Cottage | Coniston Lake District Cottages">
And the HeyMeta result: https://www.heymeta.com/url/www.coppermines.co.uk/accommodation/forge-bridge-cottage-coniston
How can my code be adapted to consistently retrieve the meta data from websites just like Facebook and Twitter without apparently being identified as a scraping bot? Is there a way to indicate in the request that I’m only interested in what should be publicly available meta data?
Thanks to comment from @Charliface, the answer is to add UserAgent (How? ref How do I set a default User Agent on an HttpClient?)
As a friendly bot just looking to provide a link back to site referenced by a site user in a post, my API only needs the meta data from the head, but this can be extracted from the full html returned
using (HttpClient httpClient = new HttpClient())
{
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("C# App");
httpClient.Timeout = TimeSpan.FromSeconds(20);
using (HttpResponseMessage response = await httpClient.GetAsync(uri).ConfigureAwait(false))
{
HttpContent content = response.Content;
var html = await content.ReadAsStringAsync();
return html;
}
}
Example linkback to OpenDemocracy.net created from the extracted meta data:
Side note: should use
await
not.Result
likevar html = await content.ReadAsStringAsync();
. Alsoresponse
needsusing
Try adding a reasonable looking User Agent. Ultimately that Cloudflare page is trying to prevent bots, which is what you are.
Thank you, I’ve added the “await” and using on response. The key was adding the UserAgent as described here stackoverflow.com/questions/44076962/… (use either httpClient.DefaultRequestHeaders.UserAgent.ParseAdd(“Mozilla/5.0 (compatible; AcmeInc/1.0)”) or DefaultRequestHeaders.Add(“User-Agent”, “C# App”);)