Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hindi Language support #1617

Closed
chaudhary-mohit opened this issue May 29, 2024 · 7 comments
Closed

Hindi Language support #1617

chaudhary-mohit opened this issue May 29, 2024 · 7 comments
Labels

Comments

@chaudhary-mohit
Copy link

chaudhary-mohit commented May 29, 2024

🚀 The feature

#Hindi Language Support for Indians
As for Indians, Hindi is also must be considered in Doctr-Vocabs.

Motivation, pitch

As in India, Mostly documents are in Hindi language which is not currently supported by Doctr. The only thing which we Indians need so it will easy to create POC's and make solutions using Doctr as a first step in OCR related stuffs.

Alternatives

No response

Additional context

No response

@felixdittrich92 felixdittrich92 linked a pull request May 30, 2024 that will close this issue
@ramSeraph
Copy link

Let me know what is required w.r.t. datasets to make this happen, BTW this exists - https://github.com/iitb-research-code/indic-doctr . And there is this for corpus - https://huggingface.co/datasets/ai4bharat/sangraha.

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Aug 15, 2024

Vocabs was added in #1687

@ramSeraph
Copy link

I haven't spent enough time on this yet. So, my comprehension of the requirements might be a bit lacking. But are you saying adding the vocab is enough, as others can now use the code base with their own models for Hindi?

If that is the case, maybe we can add the vocab for the rest of the scripts as well in a separate issue - I see the other indic vocabs here - https://github.com/iitb-research-code/indic-doctr/blob/main/doctr/datasets/vocabs.py

@felixdittrich92
Copy link
Contributor

If you have a model which was trained with doctr and on exactly the added vocabs (same char order and length) then yes

@felixdittrich92
Copy link
Contributor

In general you should already be able to use one of the provided models here:
https://github.com/iitb-research-code/indic-doctr/releases

For example:

import torch
from doctr.models import ocr_predictor, crnn_vgg16_bn

# Vocab copied from the indic-doctr repo
vocab = 'ॲऽऐथफएऎह८॥ॉम९ुँ१ं।षघठर॓ॼड़गछिॱटऩॄऑवल५ढ़य़अञसऔयण॑क़॒ौॽशऍ॰ूीऒॊख़उज़ॻॅ३ओऌळनॠ०ेढङ४़ॢग़पऊॐज२डैभझकआदबऋखॾ॔ोइ्धतफ़ईृःा६चऱऴ७-'
reco_model = crnn_vgg16_bn(pretrained=False, pretrained_backbone=False, vocab=vocab)
# Download: https://github.com/iitb-research-code/indic-doctr/releases/download/model2/crnn_vgg16_bn_hindi.pt
local_model_path = "~/xyz/crnn_vgg16_bn_hindi.pt"
reco_params = torch.load(local_model_path, map_location="cpu")
reco_model.load_state_dict(reco_params)

predictor = ocr_predictor(reco_arch=reco_model, pretrained=True)

@ramSeraph
Copy link

I will check that on some sample documents. Thanks for the clarifications.

@felixdittrich92
Copy link
Contributor

Part of #1699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants