Pages

Sunday, 7 January 2024

DocLLM: JPMorgan’s New AI for Visually Rich Multimodal Document Intelligence

Introduction

Documents are everywhere in our daily lives, from forms and invoices to reports and contracts. They often contain rich and complex information that requires both textual and spatial understanding. However, most of the existing artificial intelligence (AI) models are not well-equipped to handle such multimodal documents, as they either ignore the layout structure or rely on expensive image encoders. 

A new generative language model developed by a team of researchers at JPMorgan AI Research. JPMorgan Chase is the leading financial institution in the world .The primary goal behind the development of this new model was to create a model capable of understanding and reasoning over visual documents, taking into account both textual semantics and spatial layout. This new model provides a scalable and robust solution for document intelligence, which is a key area of interest for JPMorgan and other businesses that deal with large volumes of diverse documents. This new model is called 'DocLLM'.

What is DocLLM?

DocLLM is a lightweight extension to traditional large language models (LLMs) designed for reasoning over visual documents. It stands out by focusing exclusively on bounding box information to incorporate the spatial layout structure, avoiding the need for expensive image encoders.

Key Features of DocLLM

DocLLM has several key features that make it a unique and powerful model for multimodal document understanding. Some of these features are:

  • Disentangled Spatial Attention Mechanism: One of the standout features of DocLLM is its disentangled spatial attention mechanism. This mechanism is a novel approach that decomposes the attention mechanism found in classical transformers into a set of disentangled matrices. This decomposition allows for a more nuanced understanding of the document.
  • Handling of Irregular Layouts: Traditional models often struggle with irregular layouts found in visual documents. However, DocLLM’s unique approach allows it to handle these irregular layouts effectively. This makes it a versatile tool for document understanding.
  • Dealing with Heterogeneous Content: Visual documents often contain heterogeneous content, which can be challenging for many models. DocLLM, with its unique features, is capable of dealing with such content, making it a robust model for multimodal document understanding.

These features make DocLLM a powerful tool for understanding and reasoning over visual documents, taking into account both textual semantics and spatial layout. Its ability to handle irregular layouts and heterogeneous content sets it apart from traditional models.

Capabilities/Use Case of DocLLM

  • Fine-Tuning Using a Large-Scale Instruction Dataset: DocLLM is fine-tuned using a large-scale instruction dataset. This dataset covers four core document intelligence tasks, providing a comprehensive training ground for the model.
  • Superior Performance on Diverse Datasets: DocLLM has demonstrated its robustness and effectiveness by outperforming state-of-the-art large language models on 14 out of 16 datasets across all tasks. This shows the model’s ability to handle a wide range of document types and layouts.
  • Strong Generalization Capabilities: In addition to its impressive performance on known datasets, DocLLM also generalizes well to previously unseen datasets. It has shown strong performance on 4 out of 5 such datasets, indicating its potential for real-world applications.

Working Mechanism and Architecture of DocLLM

DocLLM is a lightweight extension to standard Large Language Models (LLMs) that excels in visually rich form understanding tasks. It models both spatial layouts and text semantics, making it intrinsically multi-modal. The model incorporates spatial layout information through bounding box coordinates of text tokens, typically obtained using Optical Character Recognition (OCR), without the need for any vision encoder component.

Key elements of DocLLM

source - https://github.com/dswang2011/DocLLM

The architecture of DocLLM is built upon the foundation of an auto-regressive transformer language model. It follows a causal decoder structure and is composed of stacked transformer blocks. Each block contains a multi-head self-attention layer and a fully connected feed-forward network. Unlike standard language models that are typically unimodal and accept only a sequence of text tokens as input, DocLLM is a multi-modal system. It integrates lightweight visual information by utilizing the spatial positions and dimensions of text tokens obtained using OCR.

The attention mechanism of LLMs is extended in DocLLM to capture dependencies between text semantics and spatial layouts. This extension allows DocLLM to understand both the textual content and the spatial arrangement of elements in a document, treating the spatial information as a distinct modality. It computes the inter-dependency between the text modality and this spatial modality in a disentangled manner.

DocLLM uses infilling text blocks as a pre-training objective, allowing the model to better leverage contextual information and handle visual documents more effectively. It also fine-tunes the pre-trained knowledge for several document intelligence tasks on instruction data curated from several datasets.

In essence, DocLLM proposes modifications to the pre-training objective to better handle visual documents, making it a powerful tool for document intelligence tasks. Its unique approach to incorporating spatial layout information and its impressive performance on various datasets make it a promising tool for future applications.

Performance Evaluation with Other Models

The performance of DocLLM was evaluated in two experimental settings:

  1. Same Datasets, Different Splits (SDDS): In this setting, DocLLM was evaluated on the unseen test split of each of the 16 datasets used for instruction-tuning. This evaluation aimed to assess how DocLLM performs when tasks and domains remain the same from training to testing.
  2. Same Tasks, Different Datasets (STDD): In this setting, DocLLM was evaluated on held-out datasets. The model was instruction-tuned on prompts from 11 of the 16 datasets considered in SDDS, and then evaluated on the test split of the remaining three datasets. This evaluation aimed to assess the performance of DocLLM when tasks remain unchanged but domains and layouts differ from training to testing.

In both SDDS and STDD settings, DocLLM was benchmarked against comparably-sized and state-of-the-art large language models (LLMs) using ZeroShot (ZS) prompts.

The evaluation metrics used included Average Normalized Levenshtein Similarity (ANLS) for VQA datasets, CIDEr for VisualMRC, accuracy for WTQ, CLS, and NLI datasets, and F1 score for KIE datasets.

Performance comparison with other multimodal and non-multimodal LLMs

source - https://arxiv.org/pdf/2401.00908.pdf

In the SDDS setting (as shown in above figure), DocLLM-7B excelled in 12 out of 16 datasets, outperforming equivalent models in 14 out of 16 datasets. It demonstrated superior performance in layout-intensive tasks such as KIE and CLS. In the STDD setting, DocLLM demonstrated superior performance compared to Llama2 across four out of five datasets, and achieved the best score overall for two of them. However, it’s important to note that classification accuracy was notably lower in DocLLM, possibly due to the model being trained using only one classification dataset, limiting its ability to generalize effectively to new datasets.

How to Access and Use This Model?

DocLLM is an open-source model, and its source code is readily available for anyone interested in using or studying it. The source code for DocLLM is hosted on GitHub. This repository contains all the necessary code files, along with instructions on how to set up and use the model. It’s a great resource for developers and researchers who want to use DocLLM in their projects or study its inner workings.

Remember, while the model is open-source and freely available, it’s important to use it responsibly and ethically, respecting all relevant guidelines and regulations. If you are interested learn more about this model, all relevant links are provided under 'source' section at the end of this article.

Conclusion

DocLLM represents a significant advancement in the field of document understanding. Its unique approach to incorporating spatial layout information and its impressive performance on various datasets make it a promising tool for future applications.

Source
research paper - https://arxiv.org/abs/2401.00908
GitHub repo - https://github.com/dswang2011/DocLLM
Hugging Face Site - https://huggingface.co/papers/2401.00908

No comments:

Post a Comment

ShowUI: Advanced Open-Source Vision-Language-Action Model for GUI

Introduction Graphical User Interface (GUI) assistants assist users to interact with digital appliances and applications. They can be an ord...