This project provides a powerful toolkit for extracting structured information from resume files. It supports various file formats and offers two distinct parsing methodologies: a highly customizable rule-based approach using Regex, and a modern, AI-driven approach using a Large Language Model (LLM).
The primary objective is to convert unstructured resume text from formats like .pdf, .docx, and .txt into a clean, structured JSON output.
- Multi-Format Support: Parses resumes from
.pdf
,.docx
, and.txt
files seamlessly. - Dual Parsing Engines:
- Regex-Based Parser: Offers granular control over data extraction through a simple and powerful XML configuration. Ideal for resumes with consistent formatting.
- LLM-Based Parser: Leverages a pre-trained Large Language Model (
google/flan-t5-large
) to intelligently identify and extract information, adapting well to varied resume layouts.
- Structured Output: Consistently outputs extracted data in a clean, easy-to-use JSON format.
- Customizable Extraction:
- For the Regex parser, you can define your own extraction rules in regex_config.xml without changing any Python code.
- For the LLM parser, the list of attributes to be extracted can be easily modified in the script.
.
├── data/
│ └── YogeshKulkarniLinkedInProfile.pdf \# Place your resume files here
├── llm\_resume\_parser.py \# Main script for LLM-based parsing
├── regex\_resume\_parser.py \# Main script for Regex-based parsing
├── regex\_config.xml \# Configuration file for the Regex parser
└── README.md
Follow these instructions to set up and run the project on your local machine.
- Python 3.8+
- For the LLM Parser: A Hugging Face API Token
- Clone the repository:
git clone \<your-repository-url\>
cd \<your-repository-name\>
- Create and activate a virtual environment (recommended):
\# For Windows
python \-m venv venv
.\\venv\\Scripts\\activate
\# For macOS/Linux
python3 \-m venv venv
source venv/bin/activate
-
Install the required dependencies:
Create a requirements.txt file with the following content:PyPDF2 docx2txt langchain langchain-community langchain-huggingface transformers torch sentencepiece
Then, install the packages:
pip install \-r requirements.txt
-
Set up Environment Variables (for LLM Parser):
The LLM parser requires a Hugging Face API token to interact with the model hub.
Create a .env file in the root directory and add your token:HUGGINGFACEHUB\_API\_TOKEN="your\_hf\_token\_here"
The script will load this variable automatically.
Before running either parser, place the resume files you want to process inside the data folder. The scripts are pre-configured to run with a sample file named YogeshKulkarniLinkedInProfile.pdf
.
This method uses the patterns defined in regex_config.xml to extract information.
- Customize (Optional): Open regex_config.xml to review or modify the regex patterns for each field you want to extract.
- Run the script:
python regex\_resume\_parser.py
The script will process the sample resume and print the extracted structured data as a JSON object to the console.
This method uses a Hugging Face model to understand the context and extract the required fields.
- Ensure your API token is set as described in the installation steps.
- Run the script:
python llm\_resume\_parser.py
The script will download the model (on the first run), process the sample resume, and print the extracted data in JSON format to the console.
The Regex-Based Parser is controlled by the regex_config.xml file. This file allows you to define:
- Terms: The specific fields to extract (e.g., Name, Email, PhoneNumber).
- Methods: The extraction logic to use (e.g., univalue_extractor for single values).
- Patterns: The specific regex patterns used to find the information.
This design allows for easy adaptation to different resume formats or extraction requirements without modifying the Python source code.
The author provides no guarantee for the program's results. This is a utility script with room for improvement. Do not depend on it entirely for critical applications.
Copyright (C) 2025 Yogesh H Kulkarni