document-analysis

Here are 104 public repositories matching this topic...

opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

python pdf parser ocr pdf-converter extract-data document-analysis pdf-parser layout-analysis ai4science pdf-extractor-rag pdf-extractor-llm pdf-extractor-pretrain

Updated Sep 15, 2025
Python

bytedance / Dolphin

Star

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

python pdf parser ocr pdf-converter document-analysis pdf-parser layout-analysis vlm-ocr

Updated Aug 29, 2025
Python

ucbepic / docetl

Star

A system for agentic LLM-powered data processing and ETL

python workflow data etl semantic-data elt data-pipelines agents document-analysis document-processing unstructured-data unstructured-data-analysis llm

Updated Sep 15, 2025
Python

NanoNets / docext

Star

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Updated Aug 25, 2025
Python

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

html pdf ocr table-of-contents excel html-parser docx documents doc scanned-documents txt document-analysis odt pdf-parser table-recognition docx-parser document-content-extraction logical-structure-extraction

Updated Sep 12, 2025
Python

wenwenyu / PICK-pytorch

Star

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

document-analysis graph-convolutional-network graph-learning graph-neural-networks document-understanding key-information-extraction

Updated Jul 25, 2024
Python

CybercentreCanada / assemblyline

Star

AssemblyLine 4: File triage and malware analysis

framework incident-response malware python3 cybersecurity cert infosec malware-analyzer malware-analysis malware-research automation-framework cyber-security file-analysis document-analysis security-automation security-tools malware-detection assemblyline security-automation-framework

Updated Sep 12, 2025
Python

jpWang / LiLT

Star

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

nlp information-extraction document-analysis document-understanding multilingual-models document-ai multimodal-pre-trained-model

Updated Oct 31, 2022
Python

pandora-analysis / pandora

Star

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

infosec document-analysis malware-detection document-analyzing

Updated Sep 15, 2025
Python

lazyFrogLOL / llmdocparser

Star

A package for parsing PDFs and analyzing their content using LLMs.

nlp ocr chunking document-analysis pdf-parser pdfparser rag llm text-chunking

Updated Aug 6, 2024
Python

masyagin1998 / robin

Star

RObust document image BINarization

python opencv ocr computer-vision deep-learning keras neural-networks document-analysis u-net document-binarization

Updated Aug 2, 2024
Python

ppaanngggg / yolo-doclaynet

Star

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

yolo document-analysis layout-analysis ultralytics yolov8 doclaynet

Updated Aug 3, 2025
Python

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

Star

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.

natural-language-processing artificial-intelligence question-answering chat-application document-analysis streamlit gpt-3 large-language-models generative-ai langchain openai-chatgpt retrieval-augmented-generation

Updated Jul 4, 2024
Python

anisha2102 / docvqa

Star

Document Visual Question Answering

computer-vision deep-learning document-analysis visual-question-answering

Updated Jul 30, 2020
Python

aws-samples / amazon-textract-transformer-pipeline

Star

Post-process Amazon Textract results with Hugging Face transformer models for document understanding

ocr document-analysis amazon-textract huggingface-transformers

Updated Dec 14, 2024
Python

monniert / docExtractor

Star

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper

pytorch segmentation historical-data document-analysis

Updated May 25, 2023
Python

Xyntopia / pydoxtools

Star

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

python nlp pdf information-retrieval extraction document-analysis document-extraction llm chatgpt

Updated Sep 5, 2024
Python

abdur75648 / UTRNet-High-Resolution-Urdu-Text-Recognition

Star

UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)

machine-learning ocr computer-vision deep-learning pytorch text-recognition high-resolution text-detection unet document-analysis urdu scene-text-recognition urdu-nlp icdar hrnet icdar2023 urdu-ocr utrnet urdu-synth

Updated Oct 8, 2024
Python

ZeningLin / ViBERTgrid-PyTorch

Star

An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"

information-extraction document-analysis key-information-extraction document-ai visual-information-extraction

Updated Jan 9, 2024
Python

JPLeoRX / detectron2-publaynet

Star

Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset

python machine-learning computer-vision deep-learning neural-network python3 pytorch artificial-intelligence neural-networks faster-rcnn document-classification object-detection document-analysis document-layout instance-segmentation layout-analysis document-layout-analysis detectron2 publaynet

Updated Apr 16, 2023
Python

Improve this page

Add a description, image, and links to the document-analysis topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-analysis topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-analysis

Here are 104 public repositories matching this topic...

opendatalab / MinerU

bytedance / Dolphin

ucbepic / docetl

NanoNets / docext

ispras / dedoc

wenwenyu / PICK-pytorch

CybercentreCanada / assemblyline

jpWang / LiLT

pandora-analysis / pandora

lazyFrogLOL / llmdocparser

masyagin1998 / robin

ppaanngggg / yolo-doclaynet

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

anisha2102 / docvqa

aws-samples / amazon-textract-transformer-pipeline

monniert / docExtractor

Xyntopia / pydoxtools

abdur75648 / UTRNet-High-Resolution-Urdu-Text-Recognition

ZeningLin / ViBERTgrid-PyTorch

JPLeoRX / detectron2-publaynet

Improve this page

Add this topic to your repo