A simple Python module to query html pages (or xml in general) using (almost) all available CSS selectors and rules, that doesn't bore you with weird objects, just plain old lists and dicts.
Why if we have BeautifulSoup? Because:
- bs4 doesn't support advanced selectors, as
a:not(.not-this-a)
(not selector). - it gets more into the lxml performance range.
- I wanted to make something useful in Nim.
- Nim is a very flexible and powerful language that I am delving a bit deeper into.
- Nimquery is a great nim module/package/library that gives us the querying capabilities.
- Nimpy is an awesome nim module/package/library that builds a python native extension (think numpy or pandas) from a nim module.
-
Build it on your OS:
- Make sure you have nim and nimble installed and working
- Clone this repo
- Run
nimble bld
to generate the sharedlib - Run
nimble tst
to test it with a bundled python script - And you are good to go!
-
Build it on a docker container (for use with alpine or ubuntu containers):
- Be sure to have make and docker insalled and working
- Clone this repo
- Run
make build
to get the alpine version (for ubuntu, set LINUX = ubuntu) - Run
make test
to test it with the bundled python script on the same container used to build - There you have your
nemo.so
file to put into your desired container!
-
Prebuilt binaries (macosx, alpine and ubuntu only, for the lazy ones):
import nemo # assuming this is in the module's path
queries = [
'body span a:not(.first-item)',
# all 'a's inside 'span's in 'body' that are not in '.first-item' class
'[href$=".pdf"]',
# all links to pdfs
'p, span'
# all of 'p's and 'span's
]
results = dict(nemo.find(some_html, queries))
# a dict mapping from the query-string to a list of the findings,
# where each finding is a dict with attributes and content on key 'text', like:
{
'body span a:not(.first-item)' : [{'tag':'a', 'text':'hi', 'class':'last-item'}],
'[href$=".pdf"]':[
{'tag':'a', 'href':'link-to-pdf'},
{'tag':'a', 'href':'link-to-other-pdf'}
],
'p, span':[
# loads of elements, or maybe none, who knows
]
}