Improved Text Parsing

I've had a long struggle in Stutter with the results returned from Mozilla's Readability API. The API returns two important objects, a "content" node with the HTML of the processed page, and the "textContent" node with just the text content of the page. We've been using the textContent to display text in Stutter. Unfortunately textContent as a DOM method is quite poor at providing human-quality results. It simply grabs the text nodes from everything in the DOM tree and concatonates them together. This means that sometimes the end of one paragraph would be butting up against the words at the start of the next one.

I've implemented a lot of crazy regular expression matching patterns attempting to identify the false collapses and fix them, but it's not perfect. I know some people have noted this, and you can even see an example of it in the demo video.

Today I bit the bullet and decided to try and use the "content" object with its full DOM and construct a better text output. Initially I was going to try and leverage innerText to do this, but the browser implementations on that feature have diverged and I can't trust it. Thankfully this is an open-source world and someone else has authored and maintaned the html-to-text package which does exactly what I want, and it even works in node.

This release implements that new parsing algorithm which should result in much better output, better background page highlighting, and overall just a better experience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.12.0

Improved Text Parsing