Reader-LM: Efficient HTML to Markdown Conversion with AI

Introduction

Markdown is a language that is used for formatting content. Users able to format text using plain text which later shall be converted to HTML format. A well formatted use of Markdown files is important in order to ensure that the files are easy to read and well organized. It makes the handling of content much easier especially where it is being shared across different groups and teams or when the same content is required to be posted on different social media platforms. There are now several ways of converting HTML to Markdown including HTML2Markdown, Turndown, and even online tools.

Some of the main issues are complex HTML structure, problem in format preservation and noise in HTML. Reader-LM has been developed to flux these problems by applying AI to enhance and full auto the conversion. This means that through AI, enhancements have been made to be able to create models such as Reader-LM, which can easily convert HTML to Markdown as it comprehends and parses the content better.

Who Developed Reader-LM?

Reader-LM is built by Jina AI — the company whose mission is to democratize Artificial Intelligence and make them open for everyone through Open-Source and Open-Science. The model was based on Jina Reader and contributed by different AI researcher and developers. The goal for Reader-LM was to build a fast and cheap tool that takes such raw, noisy HTML and converts it into clean Markdown. The primary purpose of this model is to make the process of converting the content simpler and at the same time enhancing the quality of the converted content.

What is Reader-LM?

Reader-LM is a suite of small language models for converting HTMLs into Markdowns. These models are developed to recognize the structure of HTML tables and generate neat and well-formatted Markdowns.

Model Variants

Reader-LM 0.5B: A new release of better optimized, less powerful version intended for simple tasks.
Reader-LM 1.5B: A version with larger size that allows for additional features focused to parse more complicated structure of HTML tags.

This means that these variants are tailored to suit the different needs of users, 0.5B model has efficiency at the center. while 1.5B model is more powerful and have higher processing capabilities than the other one.

source - https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/

Key Features of Reader-LM

Multilingual Support: It has provision for multilingual support and this makes it ideal for use in different countries.
Long-Context Handling: Effective in handling long documents of up to 256K tokens of context length ; particularly HTML documents.
Efficient Performance: Originally intended for optimization on edge devices with less than 1 billion parameters.
Selective Copying: Concentrate on the transfer of selected HTML content to Markdown without losing much of the information.

Capabilities/Use Cases of Reader-LM

Content Conversion: Translates raw HTML of web pages and cleans it to Markdown format for documentation and content management.
Data Cleaning: Eliminates certain unwanted components such as headers, footers, and sidebars giving out a cleaner input.
Real-World Examples: Other than documentation, blogging, and content management system where clean Markdown is desirable, Reader-LM also has other real time utilization. For instance, it can be applied to build clean feed readers by parsing the raw HTML from various sources and translating them to structured Markdown which are easier to summarise and to identify topics. Due to its information extraction and structuring features, it can be applied in enhancing the quality of web for the visually impaired, developing individualized feeds and constructing content feeds, and extracting data for market research.

How Reader-LM Works

This is unlike most other reader-LM that uses a specific method to transform raw HTML to clean Markdown. Thus, instead of conventional approaches such as headless Chrome, Readability, regex, and Turndown library, Reader-LM makes use of a small language model (SLM) in this regard. This SLM is especially designed to learn how to work with the data input in the HTML format and output the Markdown format without the need for extensive use of rules that define the conversions. The following figure graphically illustrates this transition from a complex linear model that incorporates several stages, to the efficient model of SLM.

Illustration of reader-lm, replacing the pipeline of readability+turndown+regex heuristics using a small language model.

source - https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/

Architecture/Design and Workflow

This SLM has been the key to Reader-LM’s architectural design for dealing with the challenges of converting HTML to Markdown. The HTML to markdown translator is trained on a huge training corpus of HTML and Markdown samples which helped the model learn the full features of HTML, Markdown and their interactions. Whenever a new HTML input is passed to Reader-LM, it moves from left to right and computes the most likely Markdown tokens according to the training set as well as the input HTML. This way, Reader-LM is able to retain the layout and content of the HTML whilst providing the reader with clean, properly formatted Markdown.

Uniqueness in Training Strategy

The training strategy adopted for Reader-LM is very important for it to be effective. This model in particular goes through a two-stage training process, namely on the ‘short-and-simple’ HTML as well as on the ‘long-and-hard’ HTML. It also helps the model to first learn basic concepts of HTML to Markdown then slowly it is trained with real world and long HTML documents. Further, the developers have used some strategies towards the difficulties in the degeneration and the training when the inputs are long such as contrastive search, repetition stop criteria and chunk-wise model forwarding. Combined with the selective copying and long-context policies, these strategies make for a high efficacy of Reader’s LM to convert HTML to Markdown.

Performance Evaluation of Reader-LM

To assess the performance of Reader-LM, the developers benchmarked it against Large Language Models such as GPT-4 and Gemini-1.5, measured by using the four metrics; Recycle Option for Ubiquitous Generation and Evaluation of reference summaries, TER and WER. The ROUGE-L evaluation computes the number of overlapping tokens which provides a measure of the model’s performance in capturing the content. TER, intended to assess hallucination, quantifies the rates of generated Markdown tokens which are unique to the generated output but were not present in the original HTML. WER which is often used in tasks such as OCR targets the word sequence and then gives a breakdown of insertions, deletions and substitution in a detailed manner in order to compare the output Markdown to the actual Markdown that is expected.

source - https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/

Reader-LM, particularly the 1.5B model, offered very promising outcomes, with the highest score, 0.72 of ROUGE-L, as well as the lowest WER which was 1.87 and TER 0.19, which proves that the 1.5B model outperforms much larger ones in its aim to accurately translate HTML into Markdown with the lowest levels of errors and hallucinations can be considered.

source - https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/

In addition, there was a qualitative analysis that received a visual analysis of Markdown-language outcoming from 22 HTML sources that represent diverse language and website types. This evaluation considered four key dimensions: The first four skills include header extraction, main content extraction, rich structure preservation, and Markdown syntax usage, all rated from 1 to 5. The study highlighted Reader-LM-1.5B achieves high awareness in structure preservation and Markdown standard syntax while comparing with it's competitors . It also always can not outperform the Jina Reader API , but it was comparable to bigger models, like Gemini 1.5 Pro.

How to access and Use Reader-LM

Reader-LM is now released to Hugging Face where it is possible to download the latest 0.5B and 1.5B parameter models. For reading the inputs locally using Reader-LM, transformers need to be installed and then the steps listed on the Hugging Face model page of the selected version have to be followed. For the followers, who would rather use an easily understandable approach, there is a link to the Colab notebook to play with the model. Reader-LM is open-source and licensed under the CC BY-NC 4. 0 license. One has to reach out to Jina AI for commercial access.

Limitations and Future Work

Reader-LM is proved to be effective in practice yet it can experience difficulties while dealing with highly nested html structures or the information which contains a lot of noise. Future research could focus on enhancing the capacity for handling of such cases of patient management. Also, it is multilingual to a certain extent, but there is a possibility for development in this direction.

Conclusion

Reader-LM is a considerable improvement in the process of converting HTML to Markdown in comparison with methods that primarily rely on simple pattern matching and heuristics. Hence, Reader-LM that leverages SLMs will offer a more efficient and arguably more accurate solution. By this advancement it becomes easier both in the usage of web content as well as the creation and management of the content hence bringing an improvement in the organization of the environment in the internet.

Source
Jina AI website: https://jina.ai/
reader lm post: https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/
Hugging Face reader-lm 0.5b: https://huggingface.co/jinaai/reader-lm-0.5b
Hugging Face reader-lm 1.5b: https://huggingface.co/jinaai/reader-lm-1.5b
google Colab : https://colab.research.google.com/drive/1wXWyj5hOxEHY6WeHbOwEzYAC0WB1I5uA#scrollTo=lHBHjlwgQesA

Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

SocialViews From TechWorld

Pages

Tuesday, 17 September 2024

Reader-LM: Efficient HTML to Markdown Conversion with AI

No comments:

Post a Comment

Google's MLE-STAR: Winning with Real-Time Web Search