🚀 Introduction
MOLE is a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.

📄 Example
Here is an example of extracted metadata from a given sample paper. Note this example is smilplified, the actual paper is more complex and contain more pages.
📹 Demo
You can find the demo on this link to run inference on a given paper.
📚 Dataset
We manually annotated 52 papers. We split it 6 for validation and 46 for testing. The datasets covered span Arabic, English, Japanese, French, Russian, and multi-lingual datasets. The dataset is available on Hugging Face 🤗.
📊 Results
We evaluted the performance of 7 LLMs on the 46 annotated papers spanning datasets of different languages. Our results show that Gemini 2.5 Pro achieves the best scores.
Model | ar | en | jp | fr | ru | multi | Average |
---|---|---|---|---|---|---|---|
Gemma 3 27B | 56.69 | 60.00 | 66.32 | 70.00 | 66.67 | 53.68 | 60.30 |
Llama 4 Maverick | 58.28 | 66.67 | 68.42 | 68.89 | 68.89 | 58.95 | 62.67 |
Qwen 2.5 72B | 64.17 | 62.22 | 64.21 | 71.11 | 65.56 | 55.79 | 63.96 |
DeepSeek V3 | 64.17 | 70.00 | 65.26 | 64.44 | 70.00 | 54.74 | 64.56 |
Claude 3.5 Sonnet | 60.54 | 66.67 | 71.58 | 74.44 | 73.33 | 61.05 | 65.37 |
GPT 4o | 64.17 | 71.11 | 69.47 | 70.00 | 73.33 | 60.00 | 66.68 |
Gemini 2.5 Pro | 65.31 | 72.22 | 74.74 | 68.89 | 73.33 | 56.84 | 67.42 |
📝 Citation
If you find this work useful, please cite it as follows:
@misc{mole, title={MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs}, author={Zaid Alyafeai and Maged S. Al-Shaibani and Bernard Ghanem}, year={2025}, eprint={2505.19800}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.19800}, }
📑 References
1. Alyafeai, Zaid, et al. "Masader: Metadata sourcing for arabic text and speech data resources." arXiv preprint arXiv:2110.06744 (2021).