DCGen: Transforming Screenshots into UI Code with Divide-and-Conquer

Introduction

Recently, the practical creation and implementation of automatic design-to-code AI models that would turn visual designs into functional code are finally accurate. Of course, these models remain to transform visual designs into workable code without error and in less time. Despite these strides, current models struggle with challenges such as element omission, distortion, and misarrangement.

A new novel AI model developed through the coordinated team effort of a group of dedicated researchers at The Chinese University of Hong Kong, a university internationally renowned for its commitment to robust AI research and development. New model is proposed to address this, as well as other limitations related to the existing design-to-code models, by using a property strategy known as divide-and-conquer, which makes the design-to-code conversion precise and optimal, giving a much improved and effective solution. The development of this new model was motivated first and foremost by the need to resolve existing design-to-code model issues and, secondly, the need to contribute to the ongoing journey of AI advancement. This new AI model is called 'DCGen'.

source - https://arxiv.org/pdf/2406.16386

What is DCGen?

DCGen is an innovative AI model that automates the process of generating UI code from screenshots. It breaks down screenshots into manageable segments. For each segment, it generates descriptions and then reassembles them into a complete UI code.

Key Features of DCGen

Divide-and-Conquer Approach: DCGen’s standout feature is its divide-and-conquer approach. It breaks down screenshots into manageable segments, generates descriptions for each, and then reassembles them into complete UI code. This approach effectively mitigates common issues such as element omission, distortion, and misarrangement.
Segment-Aware Prompt-Based Approach: DCGen is the first model to use a segment-aware prompt-based approach for generating UI code directly from screenshots. This feature further enhances its accuracy and efficiency.
Improved Visual Similarity: DCGen achieves up to a 14% improvement in visual similarity over competing methods, making it a highly effective tool for design-to-code conversion.
Adaptability: DCGen is highly adaptable to different models, such as the Gemini model, enhancing both visual and code-level metrics.

Capabilities/Use Case of DCGen

Enhanced Accuracy: In focusing on the small visual segment, DCGen relieves general issues such as element omission and distortion, which generally occur in a coarse-to-fine process. It enhances the accuracy of generated code.
Real-world Applications: DCGen has been tested over a dataset of real-world websites for deriving accurate UI code with high visual similarity and has therefore proved to be applicable in practice for web developers.
Efficiency: DCGen enhances the design process to code and thus saves much time while minimizing potential human error. It is a potent tool for web developers, mainly when they are not coding experts. Automating the process from design to functional code can result in significant savings in website development time for a website using DCGen.

How does DCGen work? / Architecture/Workflow

The architecture of DCGen, as shown in figure below, is motivated by the traditional 'divide and conquer' algorithmic approach. A complex screenshot is first decomposed into smaller, more manageable visual parts. The framework then solves these parts individually before finally combining the solutions to address the original problem.

source - https://arxiv.org/pdf/2406.16386

The workflow of DCGen comes under two primary levels: division and assembly. At the division level, the screenshot is split recursively into small pieces. It first searches for horizontal separation lines in a webpage and divides it accordingly. For each horizontal segment, it searches for the vertical separation lines and further divides the segment. This split process repeats itself recursively, first horizontally and then vertically until no more separation lines are found or some maximum depth defined by the user is reached. For each image segment thus obtained, MLLM subsequently applies DCGen to generate code.

During this process, the code generated for smaller segments is incrementally integrated into their parent segments. This recursive assembly continues up to the complete restoration of the Web site's structure. For each leaf image segment, DCGen shows the MLLM a website screenshot with a red rectangular bounding box indicating the part of the image it should describe and then asks it to describe the image segment. For each parent image segment, DCGen shows the MLLM a screenshot with a bounding box and all of the descriptions generated from its children's image segments. In the final step, DCGen calls on Generation MLLM to produce the complete UI code for the full screen using the description of its child segments. This distinctive feature allows DCGen to generate UI code from difficult-to-understand webpage screenshots.

Evaluation of DCGen Performance

The performance of DCGen was evaluated using the high-level similarity metrics, CLIP and BLEU score, and fine-grained element matching metrics.

source - https://arxiv.org/pdf/2406.16386

It is evident from table above that DCGen showed state-of-the-art performance when compared with other design-to-code methods in both high-level and fine-grained metrics tested on GPT-4o. According to the experiments, it was verified that DCGen outperformed significantly on both of these metrics. But both DCGen and the self-refine technique improved model performance, while the CoT (Chain-of-Thought) prompting method degraded the capacities of the model.

Overall Performance of DCGen on different MLLMs

source - https://arxiv.org/pdf/2406.16386

The generalization power of DCGen was tested using the methodology on various MLLMs as backbones. This highly adaptive nature of DCGen to Gemini's model and the remarkable improvements at both visual and code-level metrics could be corroborated by above two tables. For the Claude-3 platform, DCGen achieved a significantly better ratio in terms of visual similarity between original and generated websites compared with other competing methods, indicating a further proof of concept for the divide-and-conquer strategy propounded in the proposed model. So, DCGen was found to be adequate to generate UI code from design in the context of different MLLMs.

Techniques and Methods Utilized by DCGen

DCGen employs a combination of advanced Artificial Intelligence and Machine Learning methods to automate the process of generating UI code from screenshots. Here’s a rundown of these techniques:

Multimodal Large Language Models (MLLMs): These advanced AI models can process and generate text as well as understand images. They are integral to DCGen’s ability to interpret screenshots and generate UI code. MLLMs enhance their image understanding capabilities by integrating image processing into large language models.
Divide-and-Conquer Algorithm: This classic algorithmic technique is adapted by DCGen to simplify the complex task of UI code generation. It divides the screenshot into segments, generates code for each segment, and then reassembles them to form the complete UI code.
Image Segmentation: DCGen uses a unique image segmentation algorithm to divide the screenshot into smaller, semantically meaningful pieces. This algorithm identifies both explicit (visible lines) and implicit (blank spaces or borders) separation lines within the screenshot to organize the image into segments containing complete elements.
Prompt Engineering: DCGen employs prompts to guide the MLLMs in generating code. These prompts are meticulously designed to focus the model’s attention on specific parts of the image and describe the layout and elements within each segment.
Hierarchical Structure Storage: The divided image segments are stored in a tree structure to maintain the order and hierarchy of the segments. This allows for the recursive assembly of the code from smaller segments to reconstruct the full website structure.
Evaluation Metrics: DCGen’s performance is assessed using CLIP Score and BLEU Score. The CLIP Score measures image similarity between the original and generated screenshots, while the BLEU Score evaluates the similarity between the generated code and the original code.
Fine-Grained Element Matching: This technique evaluates the generated webpages in terms of text content, position, and color. It involves detecting visual element blocks, matching them, and then evaluating the similarity of the matching blocks across several aspects.
Self-Refine Prompting: This strategy allows the model to refine its own generated code. It is used as a baseline for comparison with DCGen.

These methods and techniques are synergistically combined in the DCGen framework to effectively translate webpage design screenshots into functional UI code, offering a significant improvement over manual methods and other automated approaches.

Possible Limitations

While DCGen represents a significant advancement in the field of automatic design-to-code AI models, it may potentially face certain limitations:

Dependence on MLLMs: The performance of DCGen could be dependent on the capabilities of the Multimodal Large Language Models (MLLMs) it utilizes. If the MLLMs have certain constraints or limitations, these could impact the effectiveness of DCGen.
Handling Dynamic Websites: DCGen may currently be unable to handle dynamic websites that use server-side scripting. This could limit its use to static pages, potentially restricting its applicability in more complex web development scenarios.
Maximum Context Length: The effectiveness of DCGen may be constrained by the maximum context length that MLLMs can handle. This could potentially limit its application to more extensive web development projects where the context exceeds this maximum length.

Conclusion

DCGen's unique divide-and-conquer approach addresses the common challenges faced by existing models and offers a more efficient solution for converting visual designs into functional code. While there may still be limitations to overcome, the potential of DCGen is undeniable, and it will be exciting to see how this model evolves in the future.

Source
Research Paper: https://arxiv.org/abs/2406.16386
Research document: https://arxiv.org/pdf/2406.16386

SocialViews From TechWorld

Pages

Friday, 28 June 2024

DCGen: Transforming Screenshots into UI Code with Divide-and-Conquer

No comments:

Post a Comment

DeepSeek-V3: Efficient and Scalable AI with Mixture-of-Experts