models package

Submodules

models.Copula module

This file contains functions to generate synthetic data using copula-based methods.

Functions:
  • generate_data_copula: Function to fit a Gaussian copula to original data and generate new samples.

  • generate_synthetic_data_for_copula: Function to run comparison of original and generated data distributions using copula.

  • generate_future_data_copula: Function to generate future data using copula for imputation.

  • impute_missing_data_copula: Function to impute missing values using copula-based generated data.

Dependencies:
  • numpy

  • pandas

  • GaussianMultivariate (copulas)

  • data_preprocess

  • data_desc

Usage:
  • To generate synthetic data using copula-based methods, use the functions in this file.

models.Copula.generate_data_copula(original_data, n_samples=10000, hyperparameters=None)

Function to fit a Gaussian copula to original data and generate new samples.

Parameters:

original_datanumpy array

Original data to fit the copula.

n_samplesint, optional (default=10000)

Number of samples to generate.

hyperparametersdict, optional

Hyperparameters for copula fitting.

Returns:

numpy array

Generated synthetic

models.Copula.generate_future_data_copula(df, generated_df, future_timestamps, hyperparameters=None)

Function to generate future data using copula method

Parameters:

dfpandas DataFrame

DataFrame containing original data.

generated_dfpandas DataFrame

DataFrame containing generated data.

future_timestampslist

List of future timestamps.

hyperparametersdict, optional

Hyperparameters for copula fitting.

Returns:

pandas DataFrame

DataFrame containing the generated future data.

models.Copula.generate_synthetic_data_for_copula(df, n_samples=10000, hyperparameters=None)

Function to run comparison of original and generated data distributions using copula.

Parameters:

dfpandas DataFrame

DataFrame containing original data.

n_samplesint, optional (default=10000)

Number of samples to generate.

hyperparametersdict, optional

Hyperparameters for copula fitting.

Returns:

pandas DataFrame

DataFrame containing the generated synthetic data.

models.Copula.impute_missing_data_copula(df, hyperparameters=None)

Function to impute missing values using copula-based generated data.

Parameters:

dfpandas DataFrame

DataFrame containing original data.

hyperparametersdict, optional

Hyperparameters for copula fitting.

Returns:

pandas DataFrame

DataFrame containing the imputed data.

models.ITS module

This module contains functions to generate synthetic data using Inverse Transform Sampling (ITS).

Functions:
  • generate_data_inverse_transform: Function to generate synthetic data using Inverse Transform Sampling.

  • generate_synthetic_data_for_ITS: Function to generate synthetic data using ITS and check for acceptable error.

  • generate_future_data_ITS: Function to generate synthetic data for the future period using ITS.

  • impute_missing_data_ITS: Function to impute missing data using ITS.

Dependencies:
  • numpy

  • pandas

  • scipy

  • data_preprocess

  • streamlit

models.ITS.generate_data_inverse_transform(data, n_samples=10000)

Function to generate synthetic data using Inverse Transform Sampling.

Parameters:

datanumpy array

Original data to generate synthetic data from.

n_samplesint, optional

Number of samples to generate. Default is 10000.

Returns:

numpy array

Generated synthetic data.

models.ITS.generate_future_data_ITS(df, generated_df, future_timestamps)

Function to generate synthetic data for the future period using ITS.

Parameters:

dfpandas DataFrame

Input DataFrame with original data.

generated_dfpandas DataFrame

DataFrame containing generated synthetic data.

future_timestampsnumpy array

Timestamps for the future period.

Returns:

pandas DataFrame

DataFrame containing the generated future data.

models.ITS.generate_synthetic_data_for_ITS(df, n_samples=10000)

Function to generate synthetic data using ITS and check for acceptable error.

Parameters:

dfpandas DataFrame

Input DataFrame with original data.

n_samplesint, optional

Number of samples to generate. Default is 10000.

Returns:

pandas DataFrame

DataFrame containing the generated synthetic data.

models.ITS.impute_missing_data_ITS(df)

Function to impute missing data using Inverse Transform Sampling.

Parameters:

dfpandas DataFrame

Input DataFrame with missing data.

Returns:

pandas DataFrame

DataFrame with imputed missing data.

models.Imputation module

This module is used to find best imputation method for the given dataset. The module contains functions to train the imputation models and generate synthetic data using the trained models.

Imputation methods:
  • Forward Fill

  • Backward Fill

  • Linear Interpolation

  • KNN Imputer

  • MICE Imputer

  • Random Forest Imputer

  • Iterative Imputer

Functions:
  • generate_synthetic_data_for_imputation: Function to generate synthetic data using the specified imputation method.

  • impute_missing_data_imputation: Function to impute missing data using the specified imputation method.

Dependencies:
  • pandas

  • numpy

  • scikit-learn

models.Imputation.generate_synthetic_data_for_imputation(original_data, method='ffill')

Function to generate synthetic data using the specified imputation method.

Parameters:

original_datapandas DataFrame

Input DataFrame with original data.

methodstr, optional

Imputation method to use. Default is ‘ffill’.

Returns:

pandas DataFrame

DataFrame containing the synthetic data generated using the specified imputation method.

models.Imputation.impute_missing_data_imputation(original_data, method='ffill')

Function to impute missing data

Parameters:

original_datapandas DataFrame

Input DataFrame with original data.

methodstr, optional

Imputation method to use. Default is ‘ffill’.

Returns:

pandas DataFrame

DataFrame containing the imputed data using the KDE method.

models.KDE module

This module contains functions to generate synthetic data using Kernel Density Estimation (KDE) method.

Functions:
  • train_kde_model_with_hyperparameter_tuning: Function to train the KDE model on the data with hyperparameter tuning.

  • generate_synthetic_data_for_KDE: Function to generate synthetic data using the KDE model.

  • generate_future_data_KDE: Generate synthetic future data using KDE.

  • impute_missing_data_KDE: Impute missing data in the DataFrame using KDE.

Dependencies:
  • numpy

  • pandas

  • sklearn

  • streamlit

  • data_preprocess

models.KDE.generate_future_data_KDE(df, generated_df, future_timestamps, bandwidths=None)

Generate synthetic future data using KDE.

Parameters:

dfpandas DataFrame

Input DataFrame with original data.

generated_dfpandas DataFrame

DataFrame to store the generated synthetic data.

future_timestampspandas DatetimeIndex

Index containing future timestamps for data generation.

bandwidthsdict, optional

Dictionary of bandwidths for each column. If None, bandwidths will be determined using hyperparameter tuning.

Returns:

pandas DataFrame

DataFrame containing the generated future data.

models.KDE.generate_synthetic_data_for_KDE(df, n_samples=10000, bandwidths=None)

Function to generate synthetic data using the KDE model.

Parameters:

dfpandas DataFrame

Input DataFrame with original data.

n_samplesint, optional

Number of synthetic samples to generate for each column. Default is 10000.

bandwidthsdict, optional

Dictionary of bandwidths for each column. If None, bandwidths will be determined using hyperparameter tuning.

Returns:

pandas DataFrame

DataFrame containing the generated synthetic data.

dict

Dictionary containing the best bandwidths for each column.

models.KDE.impute_missing_data_KDE(df, bandwidths=None)

Impute missing data in the DataFrame using KDE.

Parameters:

dfpandas DataFrame

Input DataFrame with missing values.

bandwidthsdict, optional

Dictionary of bandwidths for each column. If None, bandwidths will be determined using hyperparameter tuning.

Returns:

pandas DataFrame

DataFrame with missing values imputed using KDE.

models.KDE.train_kde_model_with_hyperparameter_tuning(df, bandwidths=None)

Train the KDE model on the data with hyperparameter tuning.

Parameters:

dfpandas DataFrame

Input DataFrame with original data.

bandwidthsdict, optional

Dictionary of bandwidths for each column. If None, bandwidths will be determined using hyperparameter tuning.

Returns:

dict

Dictionary containing the trained KDE models for each column.

dict

Dictionary containing the best bandwidths for each column.

models.MonteCarlo module

Monte Carlo Simulation

Monte Carlo Simulation is a method used to generate synthetic data. It is based on the principle of random sampling and is used to estimate the distribution of a variable by generating a large number of random samples.

The Monte Carlo Simulation model generates synthetic data by sampling from a normal distribution with a given mean and standard deviation.

Functions:
  • tune_parameters: Function to tune the parameters of the Monte Carlo Simulation model.

  • generate_data_monte_carlo: Function to generate synthetic data using the Monte Carlo Simulation model.

  • generate_synthetic_data_for_MCS: Function to generate synthetic data for all columns in the DataFrame using the Monte Carlo Simulation model.

  • generate_future_data_MCS: Function to generate future synthetic data using the Monte Carlo Simulation model.

  • impute_missing_data_MC: Function to impute missing data using the Monte Carlo Simulation model.

Dependencies:
  • numpy

  • pandas

  • scipy

  • skopt

  • streamlit

  • data_preprocess

Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) is a method used to generate synthetic data by sampling from a probability distribution. It is based on the Markov chain principle, where the next state of the chain depends only on the current state.

The Markov Chain Monte Carlo model generates synthetic data by sampling from a normal distribution with a given mean and standard deviation.

Functions:
  • generate_data_mcmc: Function to generate synthetic data using the Markov Chain Monte Carlo (MCMC) method.

  • generate_synthetic_data_for_MCMC: Function to generate synthetic data for all columns in the DataFrame using the Markov Chain Monte Carlo (MCMC) method.

  • generate_future_data_MCMC: Function to generate future synthetic data using the Markov Chain Monte Carlo (MCMC) method.

  • impute_missing_data_MCMC: Function to impute missing data using the Markov Chain Monte Carlo (MCMC) method.

Dependencies:
  • numpy

  • pandas

  • streamlit

models.MonteCarlo.generate_data_mcmc(initial_state, proposal_std, n_samples=10000, burn_in=1000)

Function to generate synthetic data using the Markov Chain Monte Carlo (MCMC) method.

Parameters:

initial_statefloat

Initial state of the Markov chain.

proposal_stdfloat

Standard deviation of the proposal distribution.

n_samplesint

Number of samples to generate.

Returns:

numpy array

Array containing the generated synthetic data.

models.MonteCarlo.generate_data_monte_carlo(mean, std, n_samples=10000)

Function to generate synthetic data using the Monte Carlo Simulation model.

Parameters:

meanfloat

Mean of the data distribution.

stdfloat

Standard deviation of the data distribution.

n_samplesint

Number of samples to generate.

Returns:

numpy array

Array containing the generated synthetic data.

models.MonteCarlo.generate_future_data_MCMC(df, generated_df, future_timestamps)

Function to generate future synthetic data using the Markov Chain Monte Carlo (MCMC) method.

Parameters:

dfpandas DataFrame

DataFrame containing the original data.

generated_dfpandas DataFrame

DataFrame containing the generated synthetic data.

future_timestampsnumpy array

Array containing the future timestamps.

Returns:

pandas DataFrame

DataFrame containing the generated future synthetic data.

models.MonteCarlo.generate_future_data_MCS(df, generated_df, future_timestamps)

Function to generate future synthetic data using the Monte Carlo Simulation model.

Parameters:

dfpandas DataFrame

DataFrame containing the original data.

generated_dfpandas DataFrame

DataFrame containing the generated synthetic data.

future_timestampsnumpy array

Array containing the future timestamps.

Returns:

pandas DataFrame

DataFrame containing the generated future synthetic data.

models.MonteCarlo.generate_synthetic_data_for_MCMC(df, n_samples=10000)

Function to generate synthetic data for all columns in the DataFrame using the Markov Chain Monte Carlo (MCMC) method.

Parameters:

dfpandas DataFrame

DataFrame containing the original data.

n_samplesint

Number of samples to generate.

Returns:

pandas DataFrame

DataFrame containing the generated synthetic data.

models.MonteCarlo.generate_synthetic_data_for_MCS(df, n_samples=10000)

Function to generate synthetic data for all columns in the DataFrame using the Monte Carlo Simulation model.

Parameters:

dfpandas DataFrame

DataFrame containing the original data.

n_samplesint

Number of samples to generate.

Returns:

pandas DataFrame

DataFrame containing the generated synthetic data.

models.MonteCarlo.impute_missing_data_MC(df)

Function to impute missing data using the Monte Carlo Simulation model.

Parameters:

dfpandas DataFrame

DataFrame containing the original data.

Returns:

pandas DataFrame

DataFrame containing the imputed synthetic data.

models.MonteCarlo.impute_missing_data_MCMC(df)

Function to impute missing data using the Markov Chain Monte Carlo (MCMC) method.

Parameters:

dfpandas DataFrame

DataFrame containing the original data.

Returns:

pandas DataFrame

DataFrame containing the imputed synthetic data.

models.MonteCarlo.objective(params, original_data)

Objective function for hyperparameter tuning of the Monte Carlo Simulation model.

Parameters:

paramstuple

Tuple containing the mean and standard deviation of the data distribution.

original_datanumpy array

Array containing the original data.

Returns:

float

Kolmogorov-Smirnov statistic between the original and generated data distributions.

models.MonteCarlo.tune_parameters(original_data)

Function to tune the parameters of the Monte Carlo Simulation model.

Parameters:

original_datanumpy array

Array containing the original data.

Returns:

tuple

Tuple containing the optimal mean and standard deviation for the Monte Carlo Simulation model.

Module contents