This project provides a Python application to convert images with text to searchable PDFs using Optical Character Recognition (OCR) technology. It is designed to enhance the efficiency of recreating scientific figures while ensuring compliance with publication standards.
- Name: Vishnu Parappulakkal
- Website: vishnu.framer.media
- High Accuracy: Uses advanced OCR models like pytesseract for precise text extraction.
- Confidentiality: Handles sensitive publication-related information with a focus on data privacy.
- Offline Capability: Operates locally without an internet connection, making it suitable for secure environments.
- No Manual Font Handling: Automatically processes text within images, eliminating the need for manual font adjustments.
- PDF Generation: Converts images and extracted text into well-formatted PDFs.
- User-Friendly Interface: Simple GUI for easy image selection and PDF saving.
- Custom Font Handling: Supports True Type fonts to replicate text styles accurately.
- Time-Saving: Reduces manual effort and accelerates the publication process.
To run this application, ensure you have the following:
- Python 3.x
- Required Python packages (listed in
requirements.txt
) - TESSDATA_PREFIX environment variable set to the correct path (refer to
ocr.py
)
- Open the Application: Launch the GUI.
- Open Image: Click "Open Image" to select an image file (.png, .jpg, .jpeg).
- Save PDF: Click "Save PDF" to convert the image to a searchable PDF.
- Opening the Image: Select an image file using the file dialog.
- Performing OCR:
- Extracts text and layout from the image.
- Positions and formats the text based on OCR data.
- Saving the PDF: The processed image and text are saved as a PDF.
PDF format is chosen for its compatibility and ease of use. Although EPS files would allow editable content, PDFs are more widely supported and interactive.
- Multiple Text Boxes: OCR may create multiple text boxes for single paragraphs, complicating editing.
- Font Size Variation: Inconsistent font sizes may occur.
- Undetected Low-Resolution Text: Poor quality images may result in inaccurate text detection.
- Manual Adjustments: Copy-paste text into a new layer in an editor for better alignment and formatting.
- Image Upscaling: Upscale images before processing for improved accuracy.
- Image Segmentation: Segment images to enhance text detection.
- Preprocessing: Techniques like noise reduction and contrast adjustment are crucial for improving OCR accuracy.
- Text Orientation: Correcting text orientation helps in accurate text extraction.
- Text Layout: Managing text layout to address font size and alignment issues is essential.
- Format Selection: Choosing the right document format based on compatibility and usability needs.
- Manual Editing: Manual adjustments can refine the final document to closely match the original content.
If you would like to contribute to this project, please fork the repository and submit a pull request with your proposed changes. Make sure to follow the contribution guidelines provided in the repository.
- pytesseract: GitHub Repository
- Pillow: GitHub Repository
Copyright [2024] Vishnu Parappulakkal. Licensed under the Apache License, Version 2.0. See LICENSE for details.
For further questions, feel free to contact me at vishnu@vishnu.framer.media.
- v1.0.0 - Initial release with core features.