RoyalRoad watermarks #2544

dipu-bd · 2025-01-07T18:36:37Z

If someone stumbles across this PR here's a script that removes the current watermarks from EPUBs -

# pip install ebooklib beautifulsoup4 lxml
import os
import sys
from ebooklib import epub
from bs4 import BeautifulSoup

watermark_contains_phrase_set = {
    "is not rightfully on Amazon",
    "Royal Road",
    "report the violation.",
    "; please report.",
    "author's consent",
    ". Report it.",
    "is not meant to be on Amazon;",
    ". Please report it.",
    "report the infringement.",
    "without the author's approval.",
    "this story on Amazon",
    "if you see it on Amazon",
    "This story has been",
    "should you find it on Amazon",
    "support the author",
    "support the creat",
    "Ensure the author gets",
    ". Report sightings.",
    "author's consent",
    "without the author's",
    "authors get the support they deserve.",
    "author's preferred platform",
    "is published on a different ",
    "author. Report any sightings."
}

def is_lowest_level(tag):
    """Check if the tag is a <div> with no child elements."""
    return tag.name == "div" and not any(child.name for child in tag.children)

def filter_epub(input_path, output_path):
    # Load the EPUB
    book = epub.read_epub(input_path)
    
    for item in book.items:
        if item.media_type == "application/xhtml+xml":
            try:
                soup = BeautifulSoup(item.get_content(), "xml")

                # Find and remove <div> blocks at the lowest level containing watermark phrases
                for div in soup.find_all(is_lowest_level):
                    if div.get_text():                        
                        text = div.get_text(strip=True).lower()
                        if any(phrase.lower() in text for phrase in watermark_contains_phrase_set):
                            print(f"Removed {div.get_text(strip=True)}")
                            div.decompose()

                # Ensure content is valid before updating
                if soup.body and soup.body.get_text(strip=True):
                    item.set_content(soup.encode('utf-8'))
                else:
                    print(f"Warning: Skipping empty content for {item.file_name}")
            except Exception as e:
                print(f"Error processing {item.file_name}: {e}")
        else:
            print(f"Skipping non-HTML item: {item.file_name} (type: {item.media_type})")

    # Save the modified EPUB
    try:
        epub.write_epub(output_path, book)
        print(f"Filtered EPUB saved as: {output_path}")
    except Exception as e:
        print(f"Failed to save EPUB: {e}")

if __name__ == "__main__":
    # Get all files in the current directory
    current_directory = os.getcwd()
    
    # Loop through all files and process those with .epub extension
    for filename in os.listdir(current_directory):
        if filename.lower().endswith(".epub") and not filename.lower().endswith("_filtered.epub"):
            input_path = os.path.join(current_directory, filename)
            output_path = os.path.join(current_directory, filename.replace(".epub", "_filtered.epub"))
            print(f"Processing: {filename}")
            filter_epub(input_path, output_path)

Unfortunately this PR misses a bunch of watermarks. Here's a sample of unique watermarks from ~1k pages -

A case of theft: this story is not rightfully on Amazon; if you spot it, report the violation.
Ensure your favorite authors get the support they deserve. Read this novel on the original website.
If you discover this narrative on Amazon, be aware that it has been unlawfully taken from Royal Road. Please report it.
If you encounter this narrative on Amazon, note that it's taken without the author's consent. Report it.
If you spot this story on Amazon, know that it has been stolen. Report the violation.
If you stumble upon this narrative on Amazon, be aware that it has been stolen from Royal Road. Please report it.
If you stumble upon this tale on Amazon, it's taken without the author's consent. Report it.
Unauthorized content usage: if you discover this narrative on Amazon, report the violation.
Unauthorized duplication: this narrative has been taken without consent. Report sightings.
Unauthorized tale usage: if you spot this story on Amazon, report the violation.
Unauthorized usage: this tale is on Amazon without the author's consent. Report any sightings.
Unauthorized use of content: if you find this story on Amazon, report the violation.
Love this story? Find the genuine version on the author's preferred platform and support their work!
Royal Road is the home of this novel. Visit there to read the original and support the author.
Taken from Royal Road, this narrative should be reported if found on Amazon.
This content has been unlawfully taken from Royal Road; report any instances of this story if found elsewhere.
This story has been unlawfully obtained without the author's consent. Report any appearances on Amazon.
This tale has been unlawfully lifted from Royal Road; report any instances of this story if found elsewhere.
This book was originally published on Royal Road. Check it out there for the real experience.
This content has been misappropriated from Royal Road; report any instances of this story if found elsewhere.
This narrative has been illicitly obtained; should you discover it on Amazon, report the violation.
You could be reading stolen content. Head to Royal Road for the genuine story.
You might be reading a stolen copy. Visit Royal Road for the authentic version.
Support the creativity of authors by visiting Royal Road for this novel and more.
Support creative writers by reading their stories on Royal Road, not stolen versions.
The author's content has been appropriated; report any instances of this story on Amazon.
This text was taken from Royal Road. Help the author by reading the original version there.
This novel is published on a different platform. Support the original author by finding the official source.
Unlawfully taken from Royal Road, this story should be reported if seen on Amazon.

Originally posted by @xp3xp3 in #2538 (comment)

The text was updated successfully, but these errors were encountered:

dipu-bd · 2025-01-08T05:32:03Z

@zGadli @xp3xp3 can you test if the royalroad watermark fix works now?

dipu-bd self-assigned this Jan 7, 2025

dipu-bd added the source-issue label Jan 7, 2025

dipu-bd closed this as completed Jan 8, 2025

dipu-bd added a commit that referenced this issue Jan 8, 2025

RoyalRoad watermarks #2544

fafe991

dipu-bd added a commit that referenced this issue Jan 11, 2025

RoyalRoad watermarks #2544

b0e03ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoyalRoad watermarks #2544

RoyalRoad watermarks #2544

dipu-bd commented Jan 7, 2025

dipu-bd commented Jan 8, 2025

RoyalRoad watermarks #2544

RoyalRoad watermarks #2544

Comments

dipu-bd commented Jan 7, 2025

dipu-bd commented Jan 8, 2025