How to change below code to get output in Bangla (title , snippet, date) and also save the output in csv file?
import json
import requests
from bs4 import BeautifulSoup
def getNewsData():
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
}
response = requests.get(
"https://www.google.com/search?q=মুক্তিযুদ্ধ&gl=bd&tbm=nws&num=1", headers=headers
)
soup = BeautifulSoup(response.content, "html.parser")
news_results = []
for el in soup.select("div.SoaBEf"):
news_results.append(
{
"link": el.find("a")["href"],
"title": el.select_one("div.MBeuO").get_text(),
"snippet": el.select_one(".GI74Re").get_text(),
"date": el.select_one(".LfVVr").get_text(),
"source": el.select_one(".NUnG9d span").get_text()
}
)
print(json.dumps(news_results, indent=2))
getNewsData()
Here’s the output.
[
{
"link": "https://dmpnews.org/%E0%A6%AE%E0%A6%B9%E0%A6%BE%E0%A6%A8-%E0%A6%AC%E0%A6%BF%E0%A6%9C%E0%A7%9F-%E0%A6%A6%E0%A6%BF%E0%A6%AC%E0%A6%B8-%E0%A6%89%E0%A6%AA%E0%A6%B2%E0%A6%95%E0%A7%8D%E0%A6%B7%E0%A7%87-%E0%A6%AC%E0%A6%BE-2/",
"title": "\u09ae\u09b9\u09be\u09a8 \u09ac\u09bf\u099c\u09af\u09bc \u09a6\u09bf\u09ac\u09b8 \u0989\u09aa\u09b2\u0995\u09cd\u09b7\u09c7 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6 \u09aa\u09c1\u09b2\u09bf\u09b6 \u09ae\u09c1\u0995\u09cd\u09a4\u09bf\u09af\u09c1\u09a6\u09cd\u09a7 \u099c\u09be\u09a6\u09c1\u0998\u09b0\u09c7\u09b0 \u0989\u09a6\u09cd\u09af\u09cb\u0997\u09c7 \n\u099a\u09bf\u09a4\u09cd\u09b0\u09be\u0999\u09cd\u0995\u09a8 \u09aa\u09cd\u09b0\u09a4\u09bf\u09af\u09cb\u0997\u09bf\u09a4\u09be \u0985\u09a8\u09c1\u09b7\u09cd\u09a0\u09bf\u09a4",
"snippet": "\u09a1\u09bf\u098f\u09ae\u09aa\u09bf \u09a8\u09bf\u0989\u099c: \u09ae\u09b9\u09be\u09a8 \u09ac\u09bf\u099c\u09af\u09bc \u09a6\u09bf\u09ac\u09b8 \u0989\u09aa\u09b2\u0995\u09cd\u09b7\u09c7 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6 \u09aa\u09c1\u09b2\u09bf\u09b6 \u09ae\u09c1\u0995\u09cd\u09a4\u09bf\u09af\u09c1\u09a6\u09cd\u09a7 \u099c\u09be\u09a6\u09c1\u0998\u09b0\u09c7\u09b0 \n\u0986\u09af\u09bc\u09cb\u099c\u09a8\u09c7 \u09aa\u09c1\u09b2\u09bf\u09b6 \u09b8\u09a6\u09b8\u09cd\u09af \u0993 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6 \u09aa\u09c1\u09b2\u09bf\u09b6\u09c7...",
"date": "\u09ea \u0998\u09a3\u09cd\u099f\u09be \u0986\u0997\u09c7",
"source": "DMP News"
}
]
What should do to encode the output in Bangla?
What should do to encode the output in Bangla?
There is a very useful tool for working with JSON files called “jq”. Here’s what it does when asked to display your JSON:
timr@Tims-NUC:~/src$ jq . < x.json
[
{
"link": "https://dmpnews.org/%E0%A6%AE%E0%A6%B9%E0%A6%BE%E0%A6%A8-%E0%A6%AC%E0%A6%BF%E0%A6%9C%E0%A7%9F-%E0%A6%A6%E0%A6%BF%E0%A6%AC%E0%A6%B8-%E0%A6%89%E0%A6%AA%E0%A6%B2%E0%A6%95%E0%A7%8D%E0%A6%B7%E0%A7%87-%E0%A6%AC%E0%A6%BE-2/",
"title": "মহান বিজয় দিবস উপলক্ষে বাংলাদেশ পুলিশ মুক্তিযুদ্ধ জাদুঘরের উদ্যোগে \nচিত্রাঙ্কন প্রতিযোগিতা অনুষ্ঠিত",
"snippet": "ডিএমপি নিউজ: মহান বিজয় দিবস উপলক্ষে বাংলাদেশ পুলিশ মুক্তিযুদ্ধ জাদুঘরের \nআয়োজনে পুলিশ সদস্য ও বাংলাদেশ পুলিশে...",
"date": "৪ ঘণ্টা আগে",
"source": "DMP News"
}
]
timr@Tims-NUC:~/src$
So the characters are all there.
Those ARE encoded in Bangla. Non-ASCII characters are not allowed in JSON, so they get encoded as backslash escapes. If you read those into an application and display them, they will appear just fine.
Is it possible to print output in Bangla?
Of course, but not in JSON. JSON is not “output”. It’s a storage format, controlled by a specification that says non-ASCII characters have to represented that way. You could easily write yourself a short utility to read that with
json.load
and display the results the way you want. That’s how JSON is supposed to be used.Use the ensure_ascii argument of json.dumps to control the encoding of strings:
print(json.dumps(news_results, ensure_ascii=False, indent=2))
.