About 10,900 results
Open links in new tab
  1. docx-corpus: 736K classified Word documents. Open DOCX dataset.

    docx-corpus is the largest open corpus of classified Word documents on the public web. It contains 736,000+ real .docx files collected from Common Crawl, validated, deduplicated, and labeled by …

  2. GitHub - superdoc-dev/docx-corpus: The largest open corpus of ...

    The largest open corpus of classified Word documents. 736K+ real .docx files from the public web, classified into 10 document types and 9 topics across 46+ languages.

  3. Dataset - docx-corpus

    Open corpus of classified Word documents from the public web. https://docxcorp.us. Built by SuperDoc. Schema, coverage, and access methods for docx-corpus. 736K classified Word documents from …

  4. superdoc-dev/docx-corpus · Datasets at Hugging Face

    The largest classified corpus of Word documents. 736K+ .docx files from the public web, classified into 10 document types and 9 topics across 76 languages. This dataset contains metadata for publicly …

  5. GitHub - superdoc-dev/docx-corpus: The largest open corpus of ...

    The largest open corpus of classified Word documents. 736K+ .docx files from the public web, classified into 10 document types and 9 topics across 46+ languages.

  6. Build and Release | superdoc-dev/docx-corpus | DeepWiki

    Jan 10, 2026 · This document covers the build process, release automation, and version management for the docx-corpus project. It explains how the codebase is compiled, tested, and packaged for …

  7. superdoc-dev/docx-corpus | DeepWiki

    Jan 10, 2026 · What is docx-corpus? The docx-corpus project is a document acquisition system that builds a comprehensive public corpus of Microsoft Word .docx files by scraping Common Crawl's …

  8. docx-corpus - The ultimate resource for .docx files | ZonalPlace

    📄 Build the largest open corpus of .docx files for effective document processing and rendering research, addressing real-world challenges in document handling.

  9. superdoc-dev/docx-corpus at main - Hugging Face

    We’re on a journey to advance and democratize artificial intelligence through open source and open science.

  10. Corpus - Orange Data Mining - undefined

    The widget also includes a directory with sample corpora that come pre-installed with the add-on. The widget reads data from Excel (.xlsx), comma-separated (.csv) and native tab-delimited (.tab) files.