Language and csv problem beautiful soup output in python [duplicate]

How to change below code to get output in Bangla (title , snippet, date) and also save the output in csv file?

import json
import requests
from bs4 import BeautifulSoup
def getNewsData():

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
}
response = requests.get(
 "https://www.google.com/search?q=মুক্তিযুদ্ধ&gl=bd&tbm=nws&num=1", headers=headers
)
soup = BeautifulSoup(response.content, "html.parser")
news_results = []

for el in soup.select("div.SoaBEf"):

    news_results.append(

        {

            "link": el.find("a")["href"],

            "title": el.select_one("div.MBeuO").get_text(),

            "snippet": el.select_one(".GI74Re").get_text(),

            "date": el.select_one(".LfVVr").get_text(),

            "source": el.select_one(".NUnG9d span").get_text()

        }

    )

print(json.dumps(news_results, indent=2))

getNewsData()

Here’s the output.

[

  {

    "link": "https://dmpnews.org/%E0%A6%AE%E0%A6%B9%E0%A6%BE%E0%A6%A8-%E0%A6%AC%E0%A6%BF%E0%A6%9C%E0%A7%9F-%E0%A6%A6%E0%A6%BF%E0%A6%AC%E0%A6%B8-%E0%A6%89%E0%A6%AA%E0%A6%B2%E0%A6%95%E0%A7%8D%E0%A6%B7%E0%A7%87-%E0%A6%AC%E0%A6%BE-2/",

    "title": "\u09ae\u09b9\u09be\u09a8 \u09ac\u09bf\u099c\u09af\u09bc \u09a6\u09bf\u09ac\u09b8 \u0989\u09aa\u09b2\u0995\u09cd\u09b7\u09c7 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6 \u09aa\u09c1\u09b2\u09bf\u09b6 \u09ae\u09c1\u0995\u09cd\u09a4\u09bf\u09af\u09c1\u09a6\u09cd\u09a7 \u099c\u09be\u09a6\u09c1\u0998\u09b0\u09c7\u09b0 \u0989\u09a6\u09cd\u09af\u09cb\u0997\u09c7 \n\u099a\u09bf\u09a4\u09cd\u09b0\u09be\u0999\u09cd\u0995\u09a8 \u09aa\u09cd\u09b0\u09a4\u09bf\u09af\u09cb\u0997\u09bf\u09a4\u09be \u0985\u09a8\u09c1\u09b7\u09cd\u09a0\u09bf\u09a4",

    "snippet": "\u09a1\u09bf\u098f\u09ae\u09aa\u09bf \u09a8\u09bf\u0989\u099c: \u09ae\u09b9\u09be\u09a8 \u09ac\u09bf\u099c\u09af\u09bc \u09a6\u09bf\u09ac\u09b8 \u0989\u09aa\u09b2\u0995\u09cd\u09b7\u09c7 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6 \u09aa\u09c1\u09b2\u09bf\u09b6 \u09ae\u09c1\u0995\u09cd\u09a4\u09bf\u09af\u09c1\u09a6\u09cd\u09a7 \u099c\u09be\u09a6\u09c1\u0998\u09b0\u09c7\u09b0 \n\u0986\u09af\u09bc\u09cb\u099c\u09a8\u09c7 \u09aa\u09c1\u09b2\u09bf\u09b6 \u09b8\u09a6\u09b8\u09cd\u09af \u0993 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6 \u09aa\u09c1\u09b2\u09bf\u09b6\u09c7...",

    "date": "\u09ea \u0998\u09a3\u09cd\u099f\u09be \u0986\u0997\u09c7",

    "source": "DMP News"

  }

]

What should do to encode the output in Bangla?
What should do to encode the output in Bangla?

  • Those ARE encoded in Bangla. Non-ASCII characters are not allowed in JSON, so they get encoded as backslash escapes. If you read those into an application and display them, they will appear just fine.

    – 

  • Is it possible to print output in Bangla?

    – 

  • Of course, but not in JSON. JSON is not “output”. It’s a storage format, controlled by a specification that says non-ASCII characters have to represented that way. You could easily write yourself a short utility to read that with json.load and display the results the way you want. That’s how JSON is supposed to be used.

    – 




  • Use the ensure_ascii argument of json.dumps to control the encoding of strings: print(json.dumps(news_results, ensure_ascii=False, indent=2)).

    – 




There is a very useful tool for working with JSON files called “jq”. Here’s what it does when asked to display your JSON:

timr@Tims-NUC:~/src$ jq . < x.json
[
  {
    "link": "https://dmpnews.org/%E0%A6%AE%E0%A6%B9%E0%A6%BE%E0%A6%A8-%E0%A6%AC%E0%A6%BF%E0%A6%9C%E0%A7%9F-%E0%A6%A6%E0%A6%BF%E0%A6%AC%E0%A6%B8-%E0%A6%89%E0%A6%AA%E0%A6%B2%E0%A6%95%E0%A7%8D%E0%A6%B7%E0%A7%87-%E0%A6%AC%E0%A6%BE-2/",
    "title": "মহান বিজয় দিবস উপলক্ষে বাংলাদেশ পুলিশ মুক্তিযুদ্ধ জাদুঘরের উদ্যোগে \nচিত্রাঙ্কন প্রতিযোগিতা অনুষ্ঠিত",
    "snippet": "ডিএমপি নিউজ: মহান বিজয় দিবস উপলক্ষে বাংলাদেশ পুলিশ মুক্তিযুদ্ধ জাদুঘরের \nআয়োজনে পুলিশ সদস্য ও বাংলাদেশ পুলিশে...",
    "date": "৪ ঘণ্টা আগে",
    "source": "DMP News"
  }
]
timr@Tims-NUC:~/src$ 

So the characters are all there.

Leave a Comment