{{PAPER_TITLE}}

Abstract

Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al., 2021) laid the groundwork for extracting wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance.

🚀 Introduction

MOLE is a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.

📄 Example

Here is an example of extracted metadata from a given sample paper. Note this example is smilplified, the actual paper is more complex and contain more pages.

📹 Demo

You can find the demo on this link to run inference on a given paper.

📚 Dataset

We manually annotated 52 papers. We split it 6 for validation and 46 for testing. The datasets covered span Arabic, English, Japanese, French, Russian, and multi-lingual datasets. The dataset is available on Hugging Face 🤗.

📊 Results

We evaluted the performance of 7 LLMs on the 46 annotated papers spanning datasets of different languages. Our results show that Gemini 2.5 Pro achieves the best scores.

Model ar en jp fr ru multi Average
Gemma 3 27B 56.69 60.00 66.32 70.00 66.67 53.68 60.30
Llama 4 Maverick 58.28 66.67 68.42 68.89 68.89 58.95 62.67
Qwen 2.5 72B 64.17 62.22 64.21 71.11 65.56 55.79 63.96
DeepSeek V3 64.17 70.00 65.26 64.44 70.00 54.74 64.56
Claude 3.5 Sonnet 60.54 66.67 71.58 74.44 73.33 61.05 65.37
GPT 4o 64.17 71.11 69.47 70.00 73.33 60.00 66.68
Gemini 2.5 Pro 65.31 72.22 74.74 68.89 73.33 56.84 67.42

📝 Citation

If you find this work useful, please cite it as follows:

@misc{mole,
    title={MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs}, 
    author={Zaid Alyafeai and Maged S. Al-Shaibani and Bernard Ghanem},
    year={2025},
    eprint={2505.19800},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2505.19800}, 
}

📑 References

1. Alyafeai, Zaid, et al. "Masader: Metadata sourcing for arabic text and speech data resources." arXiv preprint arXiv:2110.06744 (2021).