Developing anonymized datasets based on proprietary data for session-based recommendations with Graph Neural Networks (GNNs) is crucial to adhere to GDPR requirements and protect business secrets. Using proprietary data from an industry partner, this thesis will focus on creating a publishable, anonymized dataset that retains the essential characteristics needed for effective session-based recommendation while ensuring no traceability to individual users or sensitive business information. Anonymization techniques such as data aggregation, noise addition, and generalization will be employed. Additionally, synthetic data generation methods will be explored to produce a dataset that mirrors the statistical properties of the original data without compromising confidentiality. Ensuring the preservation of graph structure and session patterns is paramount to maintaining the dataset’s utility for GNN research. The project will involve a thorough evaluation of various anonymization and synthetic data generation techniques, with a focus on their effectiveness in maintaining data utility and anonymity. The expected outcome is a robustly anonymized benchmark dataset using state-of-the-art methodology for creating anonymized datasets suitable for research and publication, contributing to the broader field of machine learning on graphs in general and sequential recommendation in particular.
Requirements: Python, interest in data anonymization, interest in graph neural networks.