Skip to content

5.0.0rc1

Pre-release
Pre-release
Compare
Choose a tag to compare
@benoit74 benoit74 released this 07 Jan 10:33
· 30 commits to main since this release
9ecdb28

This is a major release with a lot of breaking changes but most changes are easy to fix.

It focuses on type safety with the introduction of runtime checks: any call to zimscraperlib API must match the type definition or an exception will be raised.

Documentation is available as docstrings and on https://python-scraperlib.readthedocs.io

Main changes includes:

  • ZIM metadata handling has completely changed with new types for each kind of metadata.
  • i18n module has been redesigned around a single main class Language
  • New rewriting module for HTTML/CSS/JS (that one being done at runtime via Wombat)
  • Now supporting only Python 3.12

Added

  • Documentation using mkdocs, published on readthedocs.com (#92)
  • rewriting module to rewrite URLs in content for generic scrapers
    • rewriting.css to rewrite URLs in CSS files
    • rewriting.html to rewrite URLs in HTML files
    • rewriting.js to rewrite URLs in JS files (at runtime, using wombat)
      • wombat-setup javascript module in javascript/
  • typing module with custom types:
    • Callback to use where we expect callbacks
    • SupportsWrite, SupportsRead, SupportsSeeking SupportsSeekableRead and SupportsSeekableWrite: protocols for IO type annotations
  • zim.metadata module with a type-based approach for each kind of metadata and helpers for custom ones
    • [zim.metadata] APPLY_RECOMMENDATIONS: general flag to toggle openZIM-recommended constraints
    • [zim.metadata] Type-based classes: Metadata, TextBasedMetadata, TextListBasedMetadata, DateBasedMetadata, IllustrationBasedMetadata
    • [zim.metadata] Usage-based classes: NameMetadata, LanguageMetadata, DefaultIllustrationMetadata, etc.
    • [zim.metadata] StandardMetadataList to package the standard metadata
    • See details for additional API endpoints and variables
  • [constants] DEFAULT_WEB_REQUESTS_TIMEOUT exposed for download module
  • [download] stream_file() now accepts timeout: int param (defaults to constant timeout) (#222)
  • [filesystem] path_from context manager to acquire a pathlib Path from Path or TemporaryDirectory
  • [i18n] Language, get_language() and get_language_or_none(). See breaking changes
  • [image.optimization] OptimizePngOptions dataclass to store PNG options
  • [image.optimization] OptimizeJpgOptions dataclass to store JPEG options
  • [image.optimization] OptimizeGifOptions dataclass to store WebP options
  • [image.optimization] OptimizeOptions dataclass to store cross-formats options
  • [inputs] unique_values() to deduplicate a list while preserving order
  • [logging] DEFAULT_FORMAT_WITH_THREADS as many scrapers uses threads
  • [video.encoding] reencode()'s existing_tmp_path param
  • [zim.filesystem] validate_folder_writable() to ensure one can write into a folder (#200)
  • [zim.creator] Creator._get_first_language_metadata_value() to retrieve first language from metadata
  • [zim.items] no_indexing_indexdata() to get an IndexData that disables indexing
  • [zim.items] URLItem.get_mimetype() now only returning str

Changed (Breaking)

  • Entire API is now type-protected using beartype. Any call to scraperlib that doesn't satisfy the annotated types will raise an exception
  • [constants] MANDATORY_ZIM_METADATA_KEYS and DEFAULT_DEV_ZIM_METADATA moved to zim/metadata
  • [download] YoutubeDownloader.download's options parameters now expect an dict[str, Any] instead of dict
  • [download] YoutubeConfig options now limited to str | bool | int | None
  • [download] _get_retry_adapter() now exposed as get_retry_adapter()
  • [download] stream_file's byte_stream' param now more flexible, accepting SupportsWrite[bytes] | SupportsSeekableWrite[bytes]`
  • [download] stream_file's proxies param now accepting dict[str, str] instead of dict
  • [filesystem] delete_callback() is now a simple callback accepting an fpath and deleting it (doesn't chain other callback anymore).
  • [filesystem] delete_callback() doesn't fail on missing file (#192)
  • [i18n] Redesigned API around a single object:
    • Language which is inited with any acceptable code. Raises NotFoundError on 639-3 matching failure
    • find_language_names() is retained but only accepts a query: str
    • added get_language() and get_language_or_none() as shortcuts around Language
    • is_valid_iso_639_3() is retained
  • [image.conversion] convert_image() now accepts io.BytesIO in place of IO[bytes] for src and dst.
  • [image.conversion] convert_svg2png() now accepts io.BytesIO in place of IO[bytes] for src and dst.
  • [image.optimization] optimize_png() now accepts options: OptimizePngOptions instead of individual params.
  • [image.optimization] optimize_jpeg() now accepts options: OptimizeJpgOptions instead of individual params.
  • [image.optimization] optimize_webp() now accepts options: OptimizeWebpOptions instead of individual params.
  • [image.optimization] optimize_gif() now accepts options: OptimizeGifOptions instead of individual params.
  • [image.presets] All presets now use the new options dataclass instead of ClassVar dict
  • [image.probing] format_for() now accepts io.BytesIO in place of IO[bytes] for src.
  • [image.probing] is_valid_image() now accepts io.BytesIO in place of IO[bytes] for image.
  • [image.utils] save_image() now accepts io.BytesIO in place of IO[bytes] for dst.
  • [video.config] Config was mostly not using type annotations.
  • [video.config] Config options only expecting str | None
  • [video.presets] All options only expecting str | None
  • [video.encoding] reencode() now always returning a tuple[bool, CompletedProcess]
  • [zim._libkiwix] MimetypeAndCounter now expects specific types for mimetype: str and value: int
  • [zim.filesystem] make_zim_file() publisherparam now properly expects anstr`
  • [zim.filesystem] IncorrectZIMPathError renamed to IncorrectPathError
  • [zim.filesystem] MissingZIMFolderError renamed to MissingFolderError
  • [zim.filesystem] NotADirectoryZIMFolderError renamed to NotADirectoryFolderError
  • [zim.filesystem] NotWritableZIMFolderError renamed to NotWritableFolderError
  • [zim.filesystem] IncorrectZIMFilenameError renamed to IncorrectFilenameError
  • [zim.filesystem] validate_zimfile_creatable() renamed to validate_file_creatable()
  • [zim.items] Item and StaticItem now expecting hints as dict[libzim.writer.Hint, int] instead of dict
  • [zim.items] Item.get_hints() now returning dict[libzim.writer.Hint, int] instead of dict
  • [zim.items] URLItem.download_for_size() now specifying type annotations and reordered params
  • [zim.providers] FileLikeProvider.gen_blob() and URLProvider.gen_blob() now properly annotates return type (Generator[libzim.writer.Blob, None, None])
  • [zim.providers] URLProvider.get_size_of() param url now explicitly expects an str
  • [zim.creator] Creator.config_metadata() signature changed, now mainly accepting a StandardMetadataList
  • [zim.creator] Creator.config_dev_metadata() signature changed to accept new metadata types
  • [zim.creator] Creator.add_item_for()'s callback renamed to callbacks and accepting Callback
  • [zim.creator] Creator.add_item()'s callback renamed to callbacks and accepting Callback

Changed

  • [deps] iso639-lang now requires at least v2.4.0
  • [download] stream_file() now return tuple[int, requests.structures.CaseInsensitiveDict[str]] instead of tuple[int, requests.structures.CaseInsensitiveDict]
  • [download] stream_file() now accepts both fpath and byte_stream params (writes to both)
  • [image.utils] save_image() now accepts Any **params.
  • [zim.archive] Archive.counters now returning CounterMap (compatible with previous dict[str, int])

Fixed

  • Direct dependencies now properly references: pillow, urllib3, piexif, idna (#226)
  • [download] YoutubeDownloader.download now respects its return type (bool | Future[Any])
  • [image.conversion] convert_image() **params properly declared as accepting None.
  • [logging] getLogger()'s' console now properly accepting TextIO | io.StringIO | None
  • [video.probing] get_media_info() type annotation for src_path
  • [zim.archive] Archive.get_item() return type (libzim.reader.Item)

Removed

  • Support for Python 3.8/3.9/3.10/3.11. Only Python 3.12 is supported now.
  • [i18n] Lang (See breaking changes)
  • [i18n] get_iso_lang_data() (See breaking changes)
  • [i18n] update_with_macro() (See breaking changes)
  • [i18n] get_language_details() (See breaking changes)
  • [uri] rebuild_uri failsafe param (was only handling incorrect types)
  • [video.encoding] reencode()'s with_process param
  • [zim.creator] Creator.validate_metadata()
  • [zim.creator] Creator.convert_and_check_metadata()