New Feature: extract from string #210

doncat99 · 2025-01-10T06:38:52Z

doncat99
Jan 10, 2025

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")  # or any other supported model

# Extract data from the document
result = extractor.extract(test_file_path, InvoiceContract)

above is the standard usage of ExtractThinker.

What if I already have custom processing for the PDF document, such as removing headers and footers and filtering out the target string from the PDF document, and I want the extractor to continue based on my pdf_string?

enoch3712 · 2025-01-10T10:07:19Z

enoch3712
Jan 10, 2025
Maintainer

Hello @doncat99!

Yes, sounds great!

There are two ways to do this:

You just do an implementation of documentLoader, that will just return this content. The only problem here is that the structure needs to be defined in a specific way, in you just return will not work.

Another way to do this, is just in the extract you can pass the raw content to enrich the extraction:

We can also make a DcoumenLoader just for injection, that checks that everything is correct.

I can do that

0 replies

doncat99 · 2025-01-11T18:49:58Z

doncat99
Jan 11, 2025
Author

@enoch3712, thanks for the hint!

The code works like below.

extractor.set_skip_loading(skip=True)
extractor.extract("string", Contract, instruction)
extractor.set_skip_loading(skip=False)

0 replies

enoch3712 · 2025-01-13T08:47:59Z

enoch3712
Jan 13, 2025
Maintainer

Hello @doncat99!

yes, sounds good!

0 replies

enoch3712 · 2025-01-17T15:30:53Z

enoch3712
Jan 17, 2025
Maintainer

Gonna take a look at this for the next release

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Feature: extract from string #210

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

New Feature: extract from string #210

doncat99 Jan 10, 2025

Replies: 4 comments

enoch3712 Jan 10, 2025 Maintainer

doncat99 Jan 11, 2025 Author

enoch3712 Jan 13, 2025 Maintainer

enoch3712 Jan 17, 2025 Maintainer

doncat99
Jan 10, 2025

enoch3712
Jan 10, 2025
Maintainer

doncat99
Jan 11, 2025
Author

enoch3712
Jan 13, 2025
Maintainer

enoch3712
Jan 17, 2025
Maintainer