docling icon indicating copy to clipboard operation
docling copied to clipboard

feat(actor): Docling Actor on Apify infrastructure

Open vancura opened this issue 1 year ago • 1 comments

Dear Docling maintainers,

I have wrapped Docling as an Apify Actor by adding the Actor definition in the .actor directory and published the Docling Actor on Apify Store. I've also added the Actor status badge and a brief usage description to the README, including the “Run on Apify” button.

For the full description of the Actor, please see the README file in the .actor directory.

Docling can now be used in the cloud without installation, free of charge. Users can avoid managing Python, OCR libraries, and ML model dependencies locally. The Actor can be used either from Apify Console, API, or CLI locally:

apify call vancura/docling -i '{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "json",
    "ocr": true
}'

The Actor processes documents and stores the results in Apify's key-value store under the OUTPUT_RESULT key. It supports multiple output formats:

  • Markdown
  • JSON
  • HTML
  • Plain text
  • Doctags (structured format)

Technical implementation

The Actor provides:

  • Cloud-based document processing through Apify's infrastructure
  • API access for easy integration
  • Support for multiple output formats
  • OCR capabilities for scanned documents
  • Integration potential with other Apify Actors
  • Clean error handling and input validation
  • Comprehensive output handling:
    • Processed documents in key-value store
    • Detailed processing logs
    • Dataset records with result URLs and status

I've packaged Docling's environment (~6GB Docker image) with all necessary dependencies:

  • Python 3.11
  • OCR libraries
  • ML models
  • Node.js 20.x
  • All required system binaries

Note: Due to the large size of the Docker image, each Actor run may take 3-5 minutes to start. This is normal behavior, and users shouldn't terminate the run prematurely.

Apify will sponsor your project

All the links to Apify in this PR are affiliate links under the Apify open source fair share program with id docling in the passive tier of the program. In the passive tier, Apify commits to sending a monthly commission via the GitHub Sponsor button from all new sign-ups that come through your link. The only action required on your part is to accept the pull request and ensure your GitHub Sponsor button is set up.

You can earn a larger commission and gain insights into traffic by registering directly with Apify, claiming ownership of the Actor on the Apify Store, and maintaining the Actor yourself. Simply contact support after signing up and pass the ownership challenge. The Actor will then be transferred, e.g., to ds4sd/docling, and you’ll see it under your Apify account.

To further increase your income from Apify, you can convert your Actor on Apify Store to the pay-per-event pricing model and join the active developer tier. We offer an individual competitive advantage for the active developer tier in the form of either a significantly reduced Apify margin or discounted compute unit pricing. Feel free to ask for it!

Benefits of the Actor Programming Model

The Web Actor Programming Model is a new concept for building serverless microapps, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud. Actors are web automation scripts that are easy to integrate and scale up. The main benefit is that even a small piece of software can be turned into a public cloud service in a heartbeat.

Apify is the largest ecosystem where developers build, deploy, and publish data extraction, web automation tools, and AI agents. With over 3,000 Actors on Apify Store and 10 years of experience in the market, Apify makes Docling accessible to over 250,000 developers using the platform monthly. This also enables integration with other Actors on Store, custom Actors, and platform integrations that can create much more powerful workflows than just individual parts.

Full disclosure

I work at Apify. Apify doesn’t sell your software, but we sell the computing resources needed to run your software in the cloud to the end users. Your project is one of the first we selected to pilot Apify's open source fair share program. Please let me know if there’s anything I can do to help you accept this PR! If you do, we’d be pleased to feature your project in our marketing communication.

If you have any questions or need assistance, don’t hesitate to reach out to me (@vancura) or @netmilk, the Apify VP of DX, or just write us to [email protected].

vancura avatar Feb 03 '25 15:02 vancura

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Feb 03 '25 15:02 mergify[bot]

@vancura We really love this PR, but one question we have is if we can synchronize the API's from https://github.com/DS4SD/docling-serve with the API you will put in place.

PeterStaar-IBM avatar Feb 27 '25 13:02 PeterStaar-IBM

@vancura it doesn't support gpu, does it? what about supporting multipart/form-data next to pdf url?

archasek avatar Feb 28 '25 12:02 archasek

@vancura it doesn't support gpu, does it? what about supporting multipart/form-data next to pdf url?

yes, it does support GPU and multipart/form-data is supported.

The key questions/proposals for us are:

  1. Let's align the input format of the APIs, so that users can easily switch between the systems.
  2. Would it make sense to run directly the docling-serve image in Apify, or the usual approach is to wrap it as you did?

dolfim-ibm avatar Feb 28 '25 13:02 dolfim-ibm

We really love this PR, but one question we have is if we can synchronize the API's from DS4SD/docling-serve with the API you will put in place.

I'd be happy to align with the docling-serve API rather than create a parallel implementation; adapting the Actor to leverage docling-serve directly would make more sense. This would involve:

  1. rewriting the Actor to use docling-serve as the underlying engine instead of calling the Docling CLI directly;
  2. ensuring the Actor's input schema matches docling-serve's API parameters exactly;
  3. adding an adapter layer to connect docling-serve's outputs with Apify's storage system.

This approach would maintain a consistent API across the entire Docling ecosystem while allowing users to benefit from the serverless deployment on Apify. The docling-serve system already has a well-designed API structure and robust error handling which we can leverage, rather than maintaining two parallel implementations.

It doesn't support gpu, does it?

The current Actor implementation runs on Apify's infrastructure, which doesn't support GPUs. It's optimized for CPU-based processing, though Docling can leverage GPUs when available (just not in the Apify case).

What about supporting multipart/form-data next to pdf url?

The Actor currently accepts document URLs but could be extended to support direct file uploads. This would require:

  • modifying the input schema to accept base64-encoded files or implementing a temporary storage solution;
  • updating the processing script to handle the uploaded files;
  • adding a file upload interface in the Actor's web UI.

I can implement these changes if this functionality would be valuable to users.

Let's align the input format of the APIs, so that users can easily switch between the systems.

I agree that aligning the input formats would provide consistency for users. After examining docling-serve, I recommend:

  • API Alignment: I can update the Actor's input schema to match docling-serve's API structure, supporting the same parameters and options. This would create a consistent experience regardless of which system users choose.
  • Container Strategy: While running docling-serve directly on Apify is possible, the wrapper approach I implemented offers several advantages:
    • better integration with Apify's platform features (key-value stores, datasets, webhooks);
    • optimized for Apify's infrastructure constraints;
    • simpler monitoring and logging specific to the Apify environment;
    • easier maintenance and updates.

That said, if you prefer, we could create a hybrid approach where we:

  1. use docling-serve as the base container;
  2. add a thin adapter layer to connect it with Apify's platform;
  3. maintain API compatibility between both systems.

This would allow users to use either system while ensuring consistent behavior and output formats.

I will take a look at these improvements as soon as I can, hopefully next week.

vancura avatar Feb 28 '25 14:02 vancura

We really love this PR,..

Hi @PeterStaar-IBM, @archasek, @dolfim-ibm,

It makes me very happy to see how supportive you are of our work. Thank you for that!

@vancura is no longer with Apify full-time but is still able to help. Due to that, we have temporarily limited availability to work on this. I’ll personally do as much as I can to help you get this PR accepted so it doesn’t get abandoned.

If you eventually accept the PR, we would like to communicate it through our marketing channels to prove the concept internally and see whether there’s any traction in adoption. I’m happy to work on further refactoring and get Apify engineers involved once the concept is proven. If you have any other ideas on how to co-market, just let me know, I'm open to any sort of collaboration.

What would be the minimal increment that would allow the PR to go through? Is it the Apify Input Object <> Docling Serve API interoperability?

netmilk avatar Mar 04 '25 20:03 netmilk

Hi, I just want to say I will work on this PR later this week! I am not going anywhere, no worries :)

vancura avatar Mar 04 '25 21:03 vancura

@netmilk @vancura Good, let's try to target to merge this latest by March 12th.

PeterStaar-IBM avatar Mar 05 '25 01:03 PeterStaar-IBM

@PeterStaar-IBM What would be the minimal increment that would allow the PR to go through, so we can prioritize? Is it the Apify Input Object <> Docling Serve API interoperability?

netmilk avatar Mar 05 '25 01:03 netmilk

@netmilk yes, let's align the input/output.

dolfim-ibm avatar Mar 05 '25 01:03 dolfim-ibm

I've completed all the requested changes to the Docling Actor. Switching from the full Docling CLI to the more efficient docking API significantly improved the Actor.

Major improvements since commit df8226fe3208a7e6c9c3c465759677d6ece6a9c1:

  1. Switched to docling-serve API: Now using the official quay.io/ds4sd/docling-serve-cpu Docker image instead of custom installation
  2. Reduced Docker image size: From ~6GB to ~4GB, improving download speed and resource usage
  3. Improved API compatibility: Updated endpoints and payload structure to match docling-serve API format
  4. Enhanced response handling: Added a dedicated Python processor script for reliable API communication
  5. Multi-stage Docker build: More efficient container with only necessary dependencies
  6. Better error handling: Improved error detection, reporting, and recovery
  7. Enhanced startup health checks: Ensures the API is fully functional before processing

These improvements make the Actor more reliable, efficient, and maintainable. The Actor is live on Apify at https://apify.com/vancura/docling and fully functional.

Please let me know if you'd like any further adjustments before merging!

vancura avatar Mar 09 '25 15:03 vancura

(Sorry, these noisy commits above are here to make DCO happy).

vancura avatar Mar 09 '25 15:03 vancura

Thank you @vancura, I've validated it and it works magic. It's 10x to 40x more effective.

@PeterStaar-IBM @dolfim-ibm, would you mind, please, indicating whether there are any outstanding issues that might be a blocker for the PR to be merged? I'm happy to help with anything.

netmilk avatar Mar 11 '25 12:03 netmilk

I see that in the current implementation you switched to call docling-serve internally but still use the new custom input schema format that you defined. This is not really what we were providing as comment

The request was: let's expose only one input schema to the user.

The fact of using docling-serve is a suggestion, in case you plan to expose it directly. If you anyway have to wrap it, then it might just introduce extra iterations.

dolfim-ibm avatar Mar 11 '25 12:03 dolfim-ibm

Thank you for the quick info. I think that it might be quite challenging to make the HTTP/REST design pattern compatible with the Web Actor Programming Model.

Just to make sure, you're asking to convert the Actor input schema (.actor/input_schema.json) in this PR, to the structure of the POST /v1alpha/convert/source request body JSON schema as defined in the Docling openapi.json.

So the intention is to make the docling-serve curl example input object compatible with apify call input object

$ echo '{
  "options": {
    "from_formats": [
      "docx",
      "pptx",
      "html",
      "image",
      "pdf",
      "asciidoc",
      "md",
      "xlsx"
    ],
    "to_formats": ["md", "json", "html", "text", "doctags"],
    "image_export_mode": "placeholder",
    "do_ocr": true,
    "force_ocr": false,
    "ocr_engine": "easyocr",
    "ocr_lang": [
      "fr",
      "de",
      "es",
      "en"
    ],
    "pdf_backend": "dlparse_v2",
    "table_mode": "fast",
    "abort_on_error": false,
    "return_as_file": false,
    "do_table_structure": true,
    "include_images": true,
    "images_scale": 2
  },
  "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}' > input.json

$ cat input.json | apify call vancura/docling

The output of the Actor is always going to be a list of links to files saved in the Actor object storage.

@dolfim-ibm Please confirm this is what you desire for us to do or explain where I misunderstood your requirements.

@vancura if you have better idea for the output compatibility, please suggest

netmilk avatar Mar 11 '25 14:03 netmilk

Our point is simplifying the amount of different input schemas that the users have to deal with. Having the same API that we will promote in docling-serve should also increase adoption of Apify.

Also note that in the payload posted above, 95% of the arguments are options. Apify could simply rely on those default to simplify it.

dolfim-ibm avatar Mar 11 '25 16:03 dolfim-ibm

The input schema is now compatible with the docling-serve API request body. The .options.return_as_file is always overridden to true.

I've made a decision for the POC to support only the zip output to the actor key-value store and save the object URL to the dataset. We're happy to iterate to support the single file input and the JSON output as a datastore item later if it gets any traction.

@dolfim-ibm @archasek Please let me know if that's closer to your expectations.

$ echo '
{
  "options": {
    "to_formats": ["md", "json", "html", "text", "doctags"]
  },
  "http_sources": [
    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
    {"url": "https://arxiv.org/pdf/2408.09869"}
  ]
}' | apify call netmilk/docling -s -o --memory=8192
...
[{
  "output_file": "https://api.apify.com/v2/key-value-stores/iOm5kxiePTX47Mrx8/records/OUTPUT",
  "format": "zip",
  "size": "8727774",
  "status": "success"
}]

$ curl -s https://api.apify.com/v2/key-value-stores/iOm5kxiePTX47Mrx8/records/OUTPUT > OUTPUT.zip
$ unzip OUTPUT.zip 
Archive:  OUTPUT.zip
  inflating: facial-hairstyles-and-filtering-facepiece-respirators.json  
  inflating: facial-hairstyles-and-filtering-facepiece-respirators.html  
  inflating: facial-hairstyles-and-filtering-facepiece-respirators.txt  
  inflating: facial-hairstyles-and-filtering-facepiece-respirators.md  
  inflating: facial-hairstyles-and-filtering-facepiece-respirators.doctags  
  inflating: 2408.09869v5.json       
  inflating: 2408.09869v5.html       
  inflating: 2408.09869v5.txt        
  inflating: 2408.09869v5.md         
  inflating: 2408.09869v5.doctags

netmilk avatar Mar 13 '25 09:03 netmilk

@netmilk I think it looks ok, but somehow you got into a dirty commit history. Can you please try to resolve it?

dolfim-ibm avatar Mar 13 '25 09:03 dolfim-ibm

@dolfim-ibm @netmilk Okay, the deep force push caused the issues. I will re-sync the branch (and hence the PR) and apply all our changes on top of it. We'll lose our (PR's) history, but it should de-chaos what's happening here, and the resulting PR will be clean. Sorry!

EDIT: We should be good now.

vancura avatar Mar 13 '25 10:03 vancura

I just wanted to express my excitement that the PR made it to main. It makes me very happy. Thank you to everyone involved for working with us on it!

Please feel free to reach out to me or mention me or @souravjain540 in any future Issues of PRs. We'll do our best to support you as much as we can.

netmilk avatar Mar 24 '25 12:03 netmilk

@netmilk Believe us, we are very excited for every PR coming from the community, this is truly the open-source, open community mindset!

PeterStaar-IBM avatar Mar 25 '25 03:03 PeterStaar-IBM