You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This seems like a potentially thorny problem to solve, given the broad variety of sources supported and the fact that every site will do its formatting differently. I recently was trying out this repo to port a long webnovel I'm reading to an e-reader, and observed that there are many instances of superfluous space surrounding quotes, parenthesis, and the like. The behavior seems to derive from how the TextCleaner class approaches its work: it calls .strip() on everything it can and then uses " ".join(...) to put it back together. This results in html such as <p>(<span>"hi there"</span>)</p> becoming reconstituted incorrectly as ( "hi there" ) or similar.
Distinguishing between when whitespace is collapsed and when it is not is a hassle, but there is a standard way to handle it in HTML. (In other words: it should be possible to join text with a space when a browser would, and avoid doing so when a browser wouldn't). However, since I didn't find a clean and simple way to change the code to obtain the desired result, and also because I'm not familiar with the history of the code and the breadth of problems it's been designed to deal with, I figured I'd inquire if this is something y'all are aware of, and if you have any specific intent to fix or not fix it?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This seems like a potentially thorny problem to solve, given the broad variety of sources supported and the fact that every site will do its formatting differently. I recently was trying out this repo to port a long webnovel I'm reading to an e-reader, and observed that there are many instances of superfluous space surrounding quotes, parenthesis, and the like. The behavior seems to derive from how the
TextCleaner
class approaches its work: it calls.strip()
on everything it can and then uses" ".join(...)
to put it back together. This results in html such as<p>(<span>"hi there"</span>)</p>
becoming reconstituted incorrectly as( "hi there" )
or similar.Distinguishing between when whitespace is collapsed and when it is not is a hassle, but there is a standard way to handle it in HTML. (In other words: it should be possible to join text with a space when a browser would, and avoid doing so when a browser wouldn't). However, since I didn't find a clean and simple way to change the code to obtain the desired result, and also because I'm not familiar with the history of the code and the breadth of problems it's been designed to deal with, I figured I'd inquire if this is something y'all are aware of, and if you have any specific intent to fix or not fix it?
Beta Was this translation helpful? Give feedback.
All reactions