Problem with scraped UTF-encoded strings #664

BaronCzerny · 2022-02-08T09:47:29Z

BaronCzerny
Feb 8, 2022

Hello,
when I scrape contents written in Spanish, which contains accented characters among other special characters, the scraped strings contain this kind of escaped sequences "\u00ed", which I think are the Unicodes for the corresponding characters, in this case "í". I would like to have these sequences converted to the corresponding character. Or is it not wise to do it, and I should feed my MongoDB collections with these strings as they are?

I have seen that there is a CLI switch called "encoding". Can I use it as an argument in get_posts(), too? I have tried it, but the scraper module complains then.

Thanks in advance for your help!

Miguel

neon-ninja · 2022-02-08T20:03:22Z

neon-ninja
Feb 8, 2022
Collaborator

Can you share a link to a post that has this problem?

3 replies

BaronCzerny Feb 10, 2022
Author

Thanks for you reply! It is not online, yet. I'm just learning to use the module and testing. My initial script looks like this:

from facebook_scraper import get_posts
from pymongo import MongoClient
import json

client = MongoClient()

## Variables globales
group_id = '513525892177643'

dict_keys = ["header", "page_id", "post_id", "user_id", "user_url", "username", "post_text", "text", "comments", "comments_full"]

f = open("barracas_scraped.txt", "a")

for post in get_posts(group_id, pages=2, cookies='cookies.json', options={"comments":True}):

   formatted_post = {x:post.get(x, "null") for x in dict_keys}

   text = json.dumps(formatted_post, indent = 4, default = str)
   f.write(text)

f.close()

And this is part of the scraped text containing several of these of this Unicodes:

{
"comments_full": [
{
"comment_id": "1791532587710294",
"comment_image": null,
"comment_reaction_count": null,
"comment_reactions": null,
"comment_reactors": [],
"comment_text": "Es precioso ver cuantas personas nos acordamos de las barracas y de los sitios donde cada uno vivimos nuestra juventut, en aquella \u00e9poca los que ten\u00edamos los 18 o 20 a\u00f1os , eramos los m\u00e1s felices ,porque ten\u00edamos salas de bailes , para relacionarnos ,y salieron muchos matrimonios que todav\u00eda vivimos y recordamos aquellos tiempos, yo llevo 59 a\u00f1os de casado",
"comment_time": "2022-02-08 07:23:46",
"comment_url": "https://facebook.com/1791532587710294",
"commenter_id": "100005651019408",
"commenter_meta": null,
"commenter_name": "Juan Ortiz Perez",
"commenter_url": "https://facebook.com/juan.ortizperez.716?groupid=513525892177643&refid=18&tn=R-R",
"replies": []
},
{

neon-ninja Feb 10, 2022
Collaborator

The problem is caused by json.dumps. Try the ensure_ascii=False argument to json.dumps. Like so:

text = json.dumps(formatted_post, indent = 4, default = str, ensure_ascii=False)

BaronCzerny Feb 11, 2022
Author

Thanks so much, it works now! I have inserted the scraped texts into a MongoDB collection and they indeed contain the correct special characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with scraped UTF-encoded strings #664

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Problem with scraped UTF-encoded strings #664

BaronCzerny Feb 8, 2022

Replies: 1 comment · 3 replies

neon-ninja Feb 8, 2022 Collaborator

BaronCzerny Feb 10, 2022 Author

neon-ninja Feb 10, 2022 Collaborator

BaronCzerny Feb 11, 2022 Author

BaronCzerny
Feb 8, 2022

Replies: 1 comment 3 replies

neon-ninja
Feb 8, 2022
Collaborator

BaronCzerny Feb 10, 2022
Author

neon-ninja Feb 10, 2022
Collaborator

BaronCzerny Feb 11, 2022
Author