Synthetic Data Generation for Learning Analytics

Bachelor thesis project of Tobias Hartel

Supervisor: Dr. Jakub Kuzilek
Reviewer 1: Prof. Dr. Niels Pinkwart
Reviewer 2: Prof. Dr. Gergana Vladova

Project Overview

In this thesis project six different synthetic data generation (SDG) methods are evaluated using the three-dimensional evaluation approach proposed by Liu et al. [1], that encompasses resemblance, utility and privacy assessment.

The selected SDG methods include Synthpop non-parametric [2], DataSynthesizer [3] and four methods from the Synthetic Data Vault (SDV) [4], namely GaussianCopula, CopulaGAN, TVAE and CTGAN. To assess the SDG methods the evaluation is conducted using five differently sized educational datasets. For more information on the specific datasets see the original_data folder. For each dataset a distinct jupyter notebook is created to carry out the evaluation and the results are accumulated one by one.

Usage

To carry out the generation and evaluation from scratch for one dataset, the existing generated datasets and results need to be deleted first. This includes the following files and entries:

in scripts/data/original_data inside the directory for the respective dataset test_data.csv and train_data.csv
in scripts/synthetic_data inside the directory for the respective dataset all six synthetic datasets (e.g. ctgan.csv)
in scripts/results/plots for both dcr and mia all plots for the respective dataset
in scripts/results/tables inside the CSV file for each evaluation metric the six entries of the respective dataset (e.g. for dataset 1_university_of_jordan all entries where Dataset is 1)

Then the respective notebook can be run.

If the evaluation is done for all five datasets, the results may be merged using the merge_results notebook, to obtain clear results as dataframe and latex table and save it as CSV file inside final_results.

Repository Structure

resources/: contains all the papers that were used as references.
scripts/: contains the original datasets, synthetic datasets, the evaluation notebooks and all of the generated data
- data/: contains both the real and synthetic datasets and the evaluation results
  - original_data/: contains the original datasets and the respective train and test splits
    - 1_university_of_jordan/
    - 2_fictional_students_perfomance/
    - 3_edge_hill_university/
    - 4_open_university/
    - 5_portuguese_school/
  - results/: contains the temporary results
    - plots/
      - dcr/
      - mia/
    - tables/
  - split_ratio_test/: contains results on dataset 1 testing different split ratios
    - 60_40/
    - 70_30/
    - 80_20/
    - 90_10/
  - synthetic_data/: contains the generated synthetic data
    - 1_university_of_jordan/
    - 2_fictional_students_perfomance/
    - 3_edge_hill_university/
    - 4_open_university/
    - 5_portuguese_school/
- final_results/: contains the final results of all datasets after being merged
  - combined_tables/
  - plots/
    - dcr/
    - mia/
    - utility/
- notebooks/: contains the jupyter notebooks for each dataset and one notebook for merging the final results
- src/: contains python scripts for evaluation and SDG
  - evaluation/
  - generation/

References

[1] Qinyi Liu, Mohammad Khalil, Ronas Shakya, and Jelena Jovanovic. 2024. Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics. In The 14th Learning Analytics and Knowledge Conference (LAK ’24), March 18–22, 2024, Kyoto, Japan. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3636555. 3636921

[2] Nowok, B., G.M. Raab & C. Dibben (2016), synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74:1-26; DOI:10.18637/jss.v074.i11. Available at: https://www.jstatsoft.org/article/view/v074i11

[3] Haoyue Ping, Julia Stoyanovich, and Bill Howe. 2017. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In Proceedings of SSDBM ’17, Chicago, IL, USA, June 27-29, 2017, 5 pages. DOI: http://dx.doi.org/10.1145/3085504.3091117

[4] N. Patki, R. Wedge and K. Veeramachaneni, "The Synthetic Data Vault," 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 2016, pp. 399-410, doi: 10.1109/DSAA.2016.49.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
resources		resources
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Data Generation for Learning Analytics

Project Overview

Usage

Repository Structure

References

About

Uh oh!

Releases

Packages

Languages

toHartel/bt-tobias-hartel

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generation for Learning Analytics

Project Overview

Usage

Repository Structure

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages