Skip to content

toHartel/bt-tobias-hartel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Data Generation for Learning Analytics

Bachelor thesis project of Tobias Hartel

  • Supervisor: Dr. Jakub Kuzilek
  • Reviewer 1: Prof. Dr. Niels Pinkwart
  • Reviewer 2: Prof. Dr. Gergana Vladova

Project Overview

In this thesis project six different synthetic data generation (SDG) methods are evaluated using the three-dimensional evaluation approach proposed by Liu et al. [1], that encompasses resemblance, utility and privacy assessment.

The selected SDG methods include Synthpop non-parametric [2], DataSynthesizer [3] and four methods from the Synthetic Data Vault (SDV) [4], namely GaussianCopula, CopulaGAN, TVAE and CTGAN. To assess the SDG methods the evaluation is conducted using five differently sized educational datasets. For more information on the specific datasets see the original_data folder. For each dataset a distinct jupyter notebook is created to carry out the evaluation and the results are accumulated one by one.

Usage

To carry out the generation and evaluation from scratch for one dataset, the existing generated datasets and results need to be deleted first. This includes the following files and entries:

  • in scripts/data/original_data inside the directory for the respective dataset test_data.csv and train_data.csv
  • in scripts/synthetic_data inside the directory for the respective dataset all six synthetic datasets (e.g. ctgan.csv)
  • in scripts/results/plots for both dcr and mia all plots for the respective dataset
  • in scripts/results/tables inside the CSV file for each evaluation metric the six entries of the respective dataset (e.g. for dataset 1_university_of_jordan all entries where Dataset is 1)

Then the respective notebook can be run.

If the evaluation is done for all five datasets, the results may be merged using the merge_results notebook, to obtain clear results as dataframe and latex table and save it as CSV file inside final_results.

Repository Structure

  • resources/: contains all the papers that were used as references.

  • scripts/: contains the original datasets, synthetic datasets, the evaluation notebooks and all of the generated data

    • data/: contains both the real and synthetic datasets and the evaluation results
      • original_data/: contains the original datasets and the respective train and test splits
        • 1_university_of_jordan/
        • 2_fictional_students_perfomance/
        • 3_edge_hill_university/
        • 4_open_university/
        • 5_portuguese_school/
      • results/: contains the temporary results
        • plots/
          • dcr/
          • mia/
        • tables/
      • split_ratio_test/: contains results on dataset 1 testing different split ratios
        • 60_40/
        • 70_30/
        • 80_20/
        • 90_10/
      • synthetic_data/: contains the generated synthetic data
        • 1_university_of_jordan/
        • 2_fictional_students_perfomance/
        • 3_edge_hill_university/
        • 4_open_university/
        • 5_portuguese_school/
    • final_results/: contains the final results of all datasets after being merged
      • combined_tables/
      • plots/
        • dcr/
        • mia/
        • utility/
    • notebooks/: contains the jupyter notebooks for each dataset and one notebook for merging the final results
    • src/: contains python scripts for evaluation and SDG
      • evaluation/
      • generation/

References

[1] Qinyi Liu, Mohammad Khalil, Ronas Shakya, and Jelena Jovanovic. 2024. Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics. In The 14th Learning Analytics and Knowledge Conference (LAK ’24), March 18–22, 2024, Kyoto, Japan. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3636555. 3636921

[2] Nowok, B., G.M. Raab & C. Dibben (2016), synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74:1-26; DOI:10.18637/jss.v074.i11. Available at: https://www.jstatsoft.org/article/view/v074i11

[3] Haoyue Ping, Julia Stoyanovich, and Bill Howe. 2017. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In Proceedings of SSDBM ’17, Chicago, IL, USA, June 27-29, 2017, 5 pages. DOI: http://dx.doi.org/10.1145/3085504.3091117

[4] N. Patki, R. Wedge and K. Veeramachaneni, "The Synthetic Data Vault," 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 2016, pp. 399-410, doi: 10.1109/DSAA.2016.49.

About

Synthetic Data Generation for Learning Analytics - Bachelor thesis project

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages