Is it possible to configure HttpClient in C# to consistently retrieve website meta data without being blocked?

I’ve set up an HttpClient to retrieve meta data from websites to build a preview when a url is posted in a user message to my website. This works with links to most sites but a significant number do not return the data expected.

For example https://www.opendemocracy.net/en/uk/ – both Facebook and Twitter retrieve the meta data and display a preview but the html returned to C# httpClient does not include the required meta data and title is “Attention Required”

<head>
<title>Attention Required! | Cloudflare</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />

I’ve been checking against HeyMeta and it too fails to retrieve the meta – https://www.heymeta.com/url/www.opendemocracy.net/en/url/www.opendemocracy.net/en/uk/

Here’s my code to retrieve the full html from which the meta can be extracted from the Head

HttpResponseMessage response = await httpClient.GetAsync(uri).ConfigureAwait(false);
HttpContent content = response.Content;
var html = content.ReadAsStringAsync().Result;

Example of the meta returned successfully

<title>Forge Bridge Cottage | Coniston</title>
<meta property="og:title" content="Forge Bridge Cottage | Coniston">
<meta property="og:description" content="Coppermines Cottages | Forge Bridge Cottage | Coniston  Lake District Cottages">

And the HeyMeta result: https://www.heymeta.com/url/www.coppermines.co.uk/accommodation/forge-bridge-cottage-coniston

How can my code be adapted to consistently retrieve the meta data from websites just like Facebook and Twitter without apparently being identified as a scraping bot? Is there a way to indicate in the request that I’m only interested in what should be publicly available meta data?

  • Side note: should use await not .Result like var html = await content.ReadAsStringAsync();. Also response needs using

    – 




  • Try adding a reasonable looking User Agent. Ultimately that Cloudflare page is trying to prevent bots, which is what you are.

    – 

  • Thank you, I’ve added the “await” and using on response. The key was adding the UserAgent as described here stackoverflow.com/questions/44076962/… (use either httpClient.DefaultRequestHeaders.UserAgent.ParseAdd(“Mozilla/5.0 (compatible; AcmeInc/1.0)”) or DefaultRequestHeaders.Add(“User-Agent”, “C# App”);)

    – 

Thanks to comment from @Charliface, the answer is to add UserAgent (How? ref How do I set a default User Agent on an HttpClient?)

As a friendly bot just looking to provide a link back to site referenced by a site user in a post, my API only needs the meta data from the head, but this can be extracted from the full html returned

        using (HttpClient httpClient = new HttpClient())
        {
            httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("C# App");
            httpClient.Timeout = TimeSpan.FromSeconds(20);

            using (HttpResponseMessage response = await httpClient.GetAsync(uri).ConfigureAwait(false))
            {
                HttpContent content = response.Content;
                var html = await content.ReadAsStringAsync();

                return html;
            }
        }

Example linkback to OpenDemocracy.net created from the extracted meta data:
enter image description here

Leave a Comment