Skip to content

Commit

Permalink
Document Docker usage in README #228
Browse files Browse the repository at this point in the history
  • Loading branch information
end-9214 authored and benoit74 committed Mar 10, 2025
1 parent 0d1614b commit 4651b02
Showing 1 changed file with 66 additions and 53 deletions.
119 changes: 66 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,78 +14,40 @@ storing content for offline usage.
> [!WARNING]
> This scraper is now known to have a serious flaw. A critical bug https://github.com/openzim/gutenberg/issues/219 has been discovered which leads to incomplete archives. Work on https://github.com/openzim/gutenberg/issues/97 (complete rewrite of the scraper logic) now seems mandatory to fix these annoying problems. We however currently miss the necessary bandwidth to address these changes. Help is of course welcomed, but be warned this is going to be a significant project (at least 10 man.days to change the scraper logic so that we can fix the issue I would say, so probably the double since human is always bad at estimations).
## Coding guidelines
Main coding guidelines comes from the [openZIM Wiki](https://github.com/openzim/overview/wiki)
## Getting Started

### Setting up the environment
The recommended way to run the Gutenberg scraper is using Docker, as it comes with all required dependencies pre-installed.

Here we will setup everything needed to run the source version from your machine, supposing you want to modify it. If you simply want to run the tool, you should either install the PyPi package or use the Docker image. Docker image can also be used for development but needs a bit of tweaking for live reload of your code modifications.

### Install the dependencies

First, ensure you use the proper Python version, inline with the requirement of `pyproject.toml` (you might for instance use `pyenv` to manage multiple Python versions in parallel).

You then need to install the various tools/libraries needed by the scraper.

#### GNU/Linux

```
sudo apt-get install python-pip python-dev libxml2-dev libxslt-dev advancecomp jpegoptim pngquant p7zip-full gifsicle curl zip zim-tools
```

#### macOS
### Running with Docker

```
brew install advancecomp jpegoptim pngquant p7zip gifsicle
```

### Setup the package

First, clone this repository.
1. **Run the scraper with Docker**:

```bash
git clone git@github.com:kiwix/gutenberg.git
cd gutenberg
docker run -it --rm -v $(pwd)/output:/output ghcr.io/openzim/gutenberg:latest gutenberg2zim
```

If you do not already have it on your system, install `hatch` to build the software and manage virtual environments (you might be interested by our detailed [Developer Setup](https://github.com/openzim/_python-bootstrap/blob/main/docs/Developer-Setup.md) as well).
The `-v $(pwd)/output:/output` option mounts the `output` folder in your current directory to the `/output` folder inside the container (which is the working directory). This ensures that the ZIM file is saved to your local machine.

```bash
pip3 install hatch
```
2. **Show available options**:

Start a hatch shell: this will install software including dependencies in an isolated virtual environment.
To view all the available options for `gutenberg2zim`, run:

```bash
hatch shell
docker run ghcr.io/openzim/gutenberg:latest gutenberg2zim --help
```

That's it. You can now run `gutenberg2zim` from your terminal.
### Arguments

## Getting started

After setting up the whole environment you can just run the main
script `gutenberg2zim`. It will download, process and export the
content.
Customize the content download with the following options. For example, to download books in English or French with IDs 100 to 200 and only in PDF format:

```bash
./gutenberg2zim
```

#### Arguments

You can also specify parameters to customize the content. Only want
books with the Id 100-200? Books only in French? English? Or only
those both? No problem! You can also include or exclude book
formats. You can add bookshelves and the option to search books by
title to enrich your user experince.

```bash
./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search
docker run -it --rm -v $(pwd)/output:/output ghcr.io/openzim/gutenberg:latest gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search
```

This will download books in English and French that have the Id 100 to
200 in the HTML (default) and PDF format.
The -it flags allow you to see progress.
The --rm flag removes the container after completion.

You can find the full arguments list below:

Expand Down Expand Up @@ -119,6 +81,57 @@ You can find the full arguments list below:
--use-any-optimized-version Try to use any optimized version found on optimization cache
```
## Contributing Code
Main coding guidelines are from the [openZIM Wiki](https://github.com/openzim/overview/wiki).
### Setting Up the Environment
Here we will setup everything needed to run the source version from your machine, supposing you want to modify it. If you simply want to run the tool, you should either install the PyPi package or use the Docker image. Docker image can also be used for development but needs a bit of tweaking for live reload of your code modifications.
### Install the dependencies
First, ensure you use the proper Python version, inline with the requirement of `pyproject.toml` (you might for instance use `pyenv` to manage multiple Python versions in parallel).
You then need to install the various tools/libraries needed by the scraper.
#### GNU/Linux
```
sudo apt-get install python-pip python-dev libxml2-dev libxslt-dev advancecomp jpegoptim pngquant p7zip-full gifsicle curl zip zim-tools
```
#### macOS
```
brew install advancecomp jpegoptim pngquant p7zip gifsicle
```
### Setup the package
First, clone this repository.
```bash
git clone git@github.com:kiwix/gutenberg.git
cd gutenberg
```
If you do not already have it on your system, install `hatch` to build the software and manage virtual environments (you might be interested by our detailed [Developer Setup](https://github.com/openzim/_python-bootstrap/blob/main/docs/Developer-Setup.md) as well).
```bash
pip3 install hatch
```
Start a hatch shell: this will install software including dependencies in an isolated virtual environment.
```bash
hatch shell
```
That's it. You can now run `gutenberg2zim` from your terminal.
## Screenshots
![](https://raw.githubusercontent.com/openzim/gutenberg/main/pictures/screenshot_1.png)
Expand All @@ -127,4 +140,4 @@ You can find the full arguments list below:
## License
[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see
[LICENSE](LICENSE) for more details.
[LICENSE](LICENSE) for more details.

0 comments on commit 4651b02

Please sign in to comment.