This project is a Tkinter-based GUI application that helps researchers and data analysts to automatically classify scientific articles using Google's Gemini AI. It includes features for retrieving articles from Scopus and generating visualizations.
- Gemini AI Classification: Advanced topic classification using Google's Gemini model
- Scopus Integration: Fetch articles from Scopus using ISSN and year range
- Interactive Visualizations:
- Word cloud generation from article abstracts
- Interactive graphs showing article distribution over time
- Citation analysis dashboards
- Simplified Workflow: Streamlined UI with fewer clicks for better user experience
- Python: Version 3.7 or higher
- Operating System: Windows, macOS, or Linux
- Internet Connection: Required for API calls to Scopus and Gemini
- Python 3.7+: Download from python.org
- Git: Download from git-scm.com
- pip: Usually comes with Python installation
- Scopus API Key: Free registration at Elsevier Developer Portal
- Gemini API Key: Free registration at Google AI Studio
git clone https://github.com/JakobSertcanli04/DataFiltrationProject
cd DataFiltrationProject
# On Windows
python -m venv venv
venv\Scripts\activate
# On macOS/Linux
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m spacy download en_core_web_sm
-
Get Scopus API Key:
- Visit Elsevier Developer Portal
- Create a free account
- Generate an API key
- Update the API key in
source/scopus_data.py
(line 15)
-
Get Gemini API Key:
- Visit Google AI Studio
- Create a free account
- Generate an API key
- Update the API key in
source/gemini.py
(line 35)
python source/main.py
If the GUI window opens, installation is successful!
The application supports standard ISSN (International Standard Serial Number) format:
- Format: 8 digits with optional hyphen (e.g.,
1879-0690
or187909090
) - Validation: The program automatically validates ISSN existence in Scopus database
- Database: All ISSN queries are performed against the Scopus database via Elsevier API
- Journal Identification: ISSN is used to uniquely identify journals in the Scopus database
- Article Retrieval: All articles from the specified journal are fetched using the ISSN
- Data Validation: The program first verifies the ISSN exists before proceeding with article retrieval
# Example from the codebase
journal = scopus_instance.getJournal("1879-0690", years, citation_limit)
- Journal Websites: Most journals display their ISSN on their homepage
- Scopus Database: Search for journals directly on scopus.com
- ISSN International Centre: Visit issn.org for official ISSN database
python source/main.py
- Enter the ISSN of the journal you want to retrieve (e.g.,
1879-0690
) - Specify the start and end years (e.g., 2020-2024)
- Set a citation limit (optional, default: 0)
- Choose where to save the CSV file
- Click "Fetch Articles"
- Select your CSV file using the browse button
- Enter topics for classification (comma-separated)
- Example:
Semiconductor,Battery,Printed Circuit Board,Electrical Waste,Water Refinement,Emission
- Example:
- Set minimum citation count (default: 10)
- Click "Run Gemini Classification"
- Word Cloud: Click "Generate Word Cloud" to create a word cloud from article abstracts
- Graph: Click "Generate Graph" to create an interactive chart showing article distribution over time
Your input CSV must have the following column headers:
DOI
Title
Abstract
Date
Link
CitationCount
Label
(will be added after classification)
- numpy: Numerical computing
- pandas: Data manipulation and analysis
- requests: HTTP library for API calls
- matplotlib: Basic plotting library
- plotly: Interactive visualizations
- wordcloud: Word cloud generation
- google-generativeai: Gemini AI integration
- tqdm: Progress bars
- scikit-learn: Machine learning utilities
- pillow: Image processing
- spacy: Natural language processing
- tkinter: GUI framework (usually comes with Python)
-
API Key Errors:
- Ensure API keys are correctly updated in the respective files
- Verify API keys are active and have sufficient quota
-
ISSN Not Found:
- Verify the ISSN format (8 digits, optional hyphen)
- Check if the journal exists in Scopus database
- Ensure the ISSN is active and not discontinued
-
Installation Issues:
- Make sure Python 3.7+ is installed
- Use virtual environment to avoid dependency conflicts
- Update pip:
pip install --upgrade pip
-
Memory Issues:
- For large datasets, increase system RAM
- Process smaller year ranges at a time
- Check the log output in the application for detailed error messages
- Ensure all prerequisites are properly installed
- Verify internet connectivity for API calls
- For large datasets, processing may take time - monitor the log output
- The Gemini classifier requires an internet connection
- The Scopus fetching feature requires API access (follow Scopus Terms & Conditions)
- Visualizations are saved as files and opened in your default browser
- Word clouds are generated for articles with sufficient citations (default: 15+ citations)
- ISSN queries are performed against the Scopus database via Elsevier's API