Project Introduction
Project Summary:
Synthetify is a advanced ML application built with Streamlit that focuses on generating and imputing synthetic data for time series datasets. It provides an interactive interface for users to upload their data, perform preprocessing steps, and generate synthetic data using a variety of advanced algorithms.
Core Functionalities:
- Data Upload and Preprocessing:
Users can upload CSV files containing time series data.
Automatic identification of timestamp columns and frequency inference.
- Preprocessing steps include:
Handling missing values (using KNN Imputation).
Outlier detection and replacement (using Z-score and IQR methods).
Removal of duplicate values and timestamps.
Handling null columns (dropping columns with over 50% null values).
Option to reintegrate missing timestamps.
- Dataset Analysis and Visualization:
- Provides detailed dataset information including:
Dataset shape and data types.
Statistical summary (using df.describe()).
Identification of timestamp column and inferred frequency.
Overview of statistical properties (mean, median, mode, etc.).
Visualization of null values and outliers.
- Synthetic Data Generation:
- Offers two main functionalities:
- Imputation of Missing Values:
Evaluates various imputation models to find the best fit.
- Models include:
KDE, Inverse Transform Sampling, Copula, Monte Carlo Simulation, Markov Chain Monte Carlo.
Forward Fill, Backward Fill, Linear Interpolation.
KNN Imputer, MICE Imputer, Random Forest Imputer, Iterative Imputer.
Applies the selected model to impute missing values.
- Generation of Future Data:
Determines the optimal model for generating future data based on data characteristics.
- Models include:
KDE, Inverse Transform Sampling, Copula, Monte Carlo Simulation, Markov Chain Monte Carlo.
Generates synthetic data for a user-defined number of future days.
Option to calculate confidence intervals (range) for generated data.
- Model Evaluation and Selection:
Utilizes statistical properties and percentage change calculations to evaluate the performance of different models.
Selects the model that best preserves the original data characteristics.
Provides transparency by showing the selected model and the reasoning behind the selection.
- User Interface and Interaction:
Streamlit-based interactive interface for easy data upload and processing.
Progress bars and success/error messages to enhance user experience.
Download options for preprocessed, imputed, and generated datasets.
File Structure:
AppConfig.py: Configures the Streamlit app’s layout and settings.
Copula.py: Implements synthetic data generation and imputation using the Copula method.
data_desc.py: Contains functions for detailed data analysis and description.
data_preprocess.py: Provides functions for preprocessing the dataset.
data_properties_utils.py: Includes functions to calculate statistical properties of data.
evaluate_models.py: Evaluates the performance of different synthetic data generation models.
Imputation.py: Implements various imputation techniques for handling missing values.
ITS.py: Implements Inverse Transform Sampling (ITS) for synthetic data generation and imputation.
KDE.py: Implements Kernel Density Estimation (KDE) for synthetic data generation and imputation.
MonteCarlo.py: Implements Monte Carlo Simulation and Markov Chain Monte Carlo for synthetic data generation and imputation.
preprocessing_button.py: Provides the Streamlit button and logic for triggering data preprocessing.
Synthesis_main.py: Contains the main logic for selecting and executing the best model for data generation/imputation.
Synthetify.py: Main entry point for the Streamlit app, handles user interactions and flow.
Potential Applications:
Forecasting and prediction in time series data.
Handling missing data in real-world datasets.
Generating realistic synthetic datasets for research and development.
Augmenting existing datasets for machine learning model training.
Future Enhancements:
Support for more advanced time series models.
Integration with visualization libraries for interactive data exploration.
Implementation of additional evaluation metrics.
Support for categorical time series data.
Prediction and forecasting capabilities on time series data.
Generation of Text data using NLP techniques.
Flowchart Link: Flowchart & Sequence Flow