Hindi Language support #1617

chaudhary-mohit · 2024-05-29T14:38:59Z

🚀 The feature

#Hindi Language Support for Indians
As for Indians, Hindi is also must be considered in Doctr-Vocabs.

Motivation, pitch

As in India, Mostly documents are in Hindi language which is not currently supported by Doctr. The only thing which we Indians need so it will easy to create POC's and make solutions using Doctr as a first step in OCR related stuffs.

Alternatives

No response

Additional context

No response

ramSeraph · 2024-08-15T09:04:54Z

Let me know what is required w.r.t. datasets to make this happen, BTW this exists - https://github.com/iitb-research-code/indic-doctr . And there is this for corpus - https://huggingface.co/datasets/ai4bharat/sangraha.

felixdittrich92 · 2024-08-15T10:13:47Z

Vocabs was added in #1687

ramSeraph · 2024-08-15T10:26:43Z

I haven't spent enough time on this yet. So, my comprehension of the requirements might be a bit lacking. But are you saying adding the vocab is enough, as others can now use the code base with their own models for Hindi?

If that is the case, maybe we can add the vocab for the rest of the scripts as well in a separate issue - I see the other indic vocabs here - https://github.com/iitb-research-code/indic-doctr/blob/main/doctr/datasets/vocabs.py

felixdittrich92 · 2024-08-15T10:45:50Z

If you have a model which was trained with doctr and on exactly the added vocabs (same char order and length) then yes

felixdittrich92 · 2024-08-15T10:53:08Z

In general you should already be able to use one of the provided models here:
https://github.com/iitb-research-code/indic-doctr/releases

For example:

import torch
from doctr.models import ocr_predictor, crnn_vgg16_bn

# Vocab copied from the indic-doctr repo
vocab = 'ॲऽऐथफएऎह८॥ॉम९ुँ१ं।षघठर॓ॼड़गछिॱटऩॄऑवल५ढ़य़अञसऔयण॑क़॒ौॽशऍ॰ूीऒॊख़उज़ॻॅ३ओऌळनॠ०ेढङ४़ॢग़पऊॐज२डैभझकआदबऋखॾ॔ोइ्धतफ़ईृःा६चऱऴ७-'
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=vocab)
# Download: https://github.com/iitb-research-code/indic-doctr/releases/download/model2/crnn_vgg16_bn_hindi.pt
local_model_path = "~/xyz/crnn_vgg16_bn_hindi.pt"
reco_params = torch.load(local_model_path, map_location="cpu")
reco_model.load_state_dict(reco_params)

predictor = ocr_predictor(reco_arch=reco_model, pretrained=True)

ramSeraph · 2024-08-15T10:59:11Z

I will check that on some sample documents. Thanks for the clarifications.

felixdittrich92 · 2024-10-10T17:11:50Z

Part of #1699

chaudhary-mohit added the type: enhancement Improvement label May 29, 2024

felixdittrich92 linked a pull request May 30, 2024 that will close this issue

[Datasets] Update vocabs.py by hindi chars #1618

Closed

felixdittrich92 closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hindi Language support #1617

Hindi Language support #1617

chaudhary-mohit commented May 29, 2024 •

edited

Loading

ramSeraph commented Aug 15, 2024

felixdittrich92 commented Aug 15, 2024 •

edited

Loading

ramSeraph commented Aug 15, 2024

felixdittrich92 commented Aug 15, 2024

felixdittrich92 commented Aug 15, 2024

ramSeraph commented Aug 15, 2024

felixdittrich92 commented Oct 10, 2024

Hindi Language support #1617

Hindi Language support #1617

Comments

chaudhary-mohit commented May 29, 2024 • edited Loading

🚀 The feature

Motivation, pitch

Alternatives

Additional context

ramSeraph commented Aug 15, 2024

felixdittrich92 commented Aug 15, 2024 • edited Loading

ramSeraph commented Aug 15, 2024

felixdittrich92 commented Aug 15, 2024

felixdittrich92 commented Aug 15, 2024

ramSeraph commented Aug 15, 2024

felixdittrich92 commented Oct 10, 2024

chaudhary-mohit commented May 29, 2024 •

edited

Loading

felixdittrich92 commented Aug 15, 2024 •

edited

Loading