models package
Submodules
models.Copula module
This file contains functions to generate synthetic data using copula-based methods.
- Functions:
generate_data_copula: Function to fit a Gaussian copula to original data and generate new samples.
generate_synthetic_data_for_copula: Function to run comparison of original and generated data distributions using copula.
generate_future_data_copula: Function to generate future data using copula for imputation.
impute_missing_data_copula: Function to impute missing values using copula-based generated data.
- Dependencies:
numpy
pandas
GaussianMultivariate (copulas)
data_preprocess
data_desc
- Usage:
To generate synthetic data using copula-based methods, use the functions in this file.
- models.Copula.generate_data_copula(original_data, n_samples=10000, hyperparameters=None)
Function to fit a Gaussian copula to original data and generate new samples.
Parameters:
- original_datanumpy array
Original data to fit the copula.
- n_samplesint, optional (default=10000)
Number of samples to generate.
- hyperparametersdict, optional
Hyperparameters for copula fitting.
Returns:
- numpy array
Generated synthetic
- models.Copula.generate_future_data_copula(df, generated_df, future_timestamps, hyperparameters=None)
Function to generate future data using copula method
Parameters:
- dfpandas DataFrame
DataFrame containing original data.
- generated_dfpandas DataFrame
DataFrame containing generated data.
- future_timestampslist
List of future timestamps.
- hyperparametersdict, optional
Hyperparameters for copula fitting.
Returns:
- pandas DataFrame
DataFrame containing the generated future data.
- models.Copula.generate_synthetic_data_for_copula(df, n_samples=10000, hyperparameters=None)
Function to run comparison of original and generated data distributions using copula.
Parameters:
- dfpandas DataFrame
DataFrame containing original data.
- n_samplesint, optional (default=10000)
Number of samples to generate.
- hyperparametersdict, optional
Hyperparameters for copula fitting.
Returns:
- pandas DataFrame
DataFrame containing the generated synthetic data.
- models.Copula.impute_missing_data_copula(df, hyperparameters=None)
Function to impute missing values using copula-based generated data.
Parameters:
- dfpandas DataFrame
DataFrame containing original data.
- hyperparametersdict, optional
Hyperparameters for copula fitting.
Returns:
- pandas DataFrame
DataFrame containing the imputed data.
models.ITS module
This module contains functions to generate synthetic data using Inverse Transform Sampling (ITS).
- Functions:
generate_data_inverse_transform: Function to generate synthetic data using Inverse Transform Sampling.
generate_synthetic_data_for_ITS: Function to generate synthetic data using ITS and check for acceptable error.
generate_future_data_ITS: Function to generate synthetic data for the future period using ITS.
impute_missing_data_ITS: Function to impute missing data using ITS.
- Dependencies:
numpy
pandas
scipy
data_preprocess
streamlit
- models.ITS.generate_data_inverse_transform(data, n_samples=10000)
Function to generate synthetic data using Inverse Transform Sampling.
Parameters:
- datanumpy array
Original data to generate synthetic data from.
- n_samplesint, optional
Number of samples to generate. Default is 10000.
Returns:
- numpy array
Generated synthetic data.
- models.ITS.generate_future_data_ITS(df, generated_df, future_timestamps)
Function to generate synthetic data for the future period using ITS.
Parameters:
- dfpandas DataFrame
Input DataFrame with original data.
- generated_dfpandas DataFrame
DataFrame containing generated synthetic data.
- future_timestampsnumpy array
Timestamps for the future period.
Returns:
- pandas DataFrame
DataFrame containing the generated future data.
- models.ITS.generate_synthetic_data_for_ITS(df, n_samples=10000)
Function to generate synthetic data using ITS and check for acceptable error.
Parameters:
- dfpandas DataFrame
Input DataFrame with original data.
- n_samplesint, optional
Number of samples to generate. Default is 10000.
Returns:
- pandas DataFrame
DataFrame containing the generated synthetic data.
models.Imputation module
This module is used to find best imputation method for the given dataset. The module contains functions to train the imputation models and generate synthetic data using the trained models.
- Imputation methods:
Forward Fill
Backward Fill
Linear Interpolation
KNN Imputer
MICE Imputer
Random Forest Imputer
Iterative Imputer
- Functions:
generate_synthetic_data_for_imputation: Function to generate synthetic data using the specified imputation method.
impute_missing_data_imputation: Function to impute missing data using the specified imputation method.
- Dependencies:
pandas
numpy
scikit-learn
- models.Imputation.generate_synthetic_data_for_imputation(original_data, method='ffill')
Function to generate synthetic data using the specified imputation method.
Parameters:
- original_datapandas DataFrame
Input DataFrame with original data.
- methodstr, optional
Imputation method to use. Default is ‘ffill’.
Returns:
- pandas DataFrame
DataFrame containing the synthetic data generated using the specified imputation method.
- models.Imputation.impute_missing_data_imputation(original_data, method='ffill')
Function to impute missing data
Parameters:
- original_datapandas DataFrame
Input DataFrame with original data.
- methodstr, optional
Imputation method to use. Default is ‘ffill’.
Returns:
- pandas DataFrame
DataFrame containing the imputed data using the KDE method.
models.KDE module
This module contains functions to generate synthetic data using Kernel Density Estimation (KDE) method.
- Functions:
train_kde_model_with_hyperparameter_tuning: Function to train the KDE model on the data with hyperparameter tuning.
generate_synthetic_data_for_KDE: Function to generate synthetic data using the KDE model.
generate_future_data_KDE: Generate synthetic future data using KDE.
impute_missing_data_KDE: Impute missing data in the DataFrame using KDE.
- Dependencies:
numpy
pandas
sklearn
streamlit
data_preprocess
- models.KDE.generate_future_data_KDE(df, generated_df, future_timestamps, bandwidths=None)
Generate synthetic future data using KDE.
Parameters:
- dfpandas DataFrame
Input DataFrame with original data.
- generated_dfpandas DataFrame
DataFrame to store the generated synthetic data.
- future_timestampspandas DatetimeIndex
Index containing future timestamps for data generation.
- bandwidthsdict, optional
Dictionary of bandwidths for each column. If None, bandwidths will be determined using hyperparameter tuning.
Returns:
- pandas DataFrame
DataFrame containing the generated future data.
- models.KDE.generate_synthetic_data_for_KDE(df, n_samples=10000, bandwidths=None)
Function to generate synthetic data using the KDE model.
Parameters:
- dfpandas DataFrame
Input DataFrame with original data.
- n_samplesint, optional
Number of synthetic samples to generate for each column. Default is 10000.
- bandwidthsdict, optional
Dictionary of bandwidths for each column. If None, bandwidths will be determined using hyperparameter tuning.
Returns:
- pandas DataFrame
DataFrame containing the generated synthetic data.
- dict
Dictionary containing the best bandwidths for each column.
- models.KDE.impute_missing_data_KDE(df, bandwidths=None)
Impute missing data in the DataFrame using KDE.
Parameters:
- dfpandas DataFrame
Input DataFrame with missing values.
- bandwidthsdict, optional
Dictionary of bandwidths for each column. If None, bandwidths will be determined using hyperparameter tuning.
Returns:
- pandas DataFrame
DataFrame with missing values imputed using KDE.
- models.KDE.train_kde_model_with_hyperparameter_tuning(df, bandwidths=None)
Train the KDE model on the data with hyperparameter tuning.
Parameters:
- dfpandas DataFrame
Input DataFrame with original data.
- bandwidthsdict, optional
Dictionary of bandwidths for each column. If None, bandwidths will be determined using hyperparameter tuning.
Returns:
- dict
Dictionary containing the trained KDE models for each column.
- dict
Dictionary containing the best bandwidths for each column.
models.MonteCarlo module
Monte Carlo Simulation
Monte Carlo Simulation is a method used to generate synthetic data. It is based on the principle of random sampling and is used to estimate the distribution of a variable by generating a large number of random samples.
The Monte Carlo Simulation model generates synthetic data by sampling from a normal distribution with a given mean and standard deviation.
- Functions:
tune_parameters: Function to tune the parameters of the Monte Carlo Simulation model.
generate_data_monte_carlo: Function to generate synthetic data using the Monte Carlo Simulation model.
generate_synthetic_data_for_MCS: Function to generate synthetic data for all columns in the DataFrame using the Monte Carlo Simulation model.
generate_future_data_MCS: Function to generate future synthetic data using the Monte Carlo Simulation model.
impute_missing_data_MC: Function to impute missing data using the Monte Carlo Simulation model.
- Dependencies:
numpy
pandas
scipy
skopt
streamlit
data_preprocess
Markov Chain Monte Carlo
Markov Chain Monte Carlo (MCMC) is a method used to generate synthetic data by sampling from a probability distribution. It is based on the Markov chain principle, where the next state of the chain depends only on the current state.
The Markov Chain Monte Carlo model generates synthetic data by sampling from a normal distribution with a given mean and standard deviation.
- Functions:
generate_data_mcmc: Function to generate synthetic data using the Markov Chain Monte Carlo (MCMC) method.
generate_synthetic_data_for_MCMC: Function to generate synthetic data for all columns in the DataFrame using the Markov Chain Monte Carlo (MCMC) method.
generate_future_data_MCMC: Function to generate future synthetic data using the Markov Chain Monte Carlo (MCMC) method.
impute_missing_data_MCMC: Function to impute missing data using the Markov Chain Monte Carlo (MCMC) method.
- Dependencies:
numpy
pandas
streamlit
- models.MonteCarlo.generate_data_mcmc(initial_state, proposal_std, n_samples=10000, burn_in=1000)
Function to generate synthetic data using the Markov Chain Monte Carlo (MCMC) method.
Parameters:
- initial_statefloat
Initial state of the Markov chain.
- proposal_stdfloat
Standard deviation of the proposal distribution.
- n_samplesint
Number of samples to generate.
Returns:
- numpy array
Array containing the generated synthetic data.
- models.MonteCarlo.generate_data_monte_carlo(mean, std, n_samples=10000)
Function to generate synthetic data using the Monte Carlo Simulation model.
Parameters:
- meanfloat
Mean of the data distribution.
- stdfloat
Standard deviation of the data distribution.
- n_samplesint
Number of samples to generate.
Returns:
- numpy array
Array containing the generated synthetic data.
- models.MonteCarlo.generate_future_data_MCMC(df, generated_df, future_timestamps)
Function to generate future synthetic data using the Markov Chain Monte Carlo (MCMC) method.
Parameters:
- dfpandas DataFrame
DataFrame containing the original data.
- generated_dfpandas DataFrame
DataFrame containing the generated synthetic data.
- future_timestampsnumpy array
Array containing the future timestamps.
Returns:
- pandas DataFrame
DataFrame containing the generated future synthetic data.
- models.MonteCarlo.generate_future_data_MCS(df, generated_df, future_timestamps)
Function to generate future synthetic data using the Monte Carlo Simulation model.
Parameters:
- dfpandas DataFrame
DataFrame containing the original data.
- generated_dfpandas DataFrame
DataFrame containing the generated synthetic data.
- future_timestampsnumpy array
Array containing the future timestamps.
Returns:
- pandas DataFrame
DataFrame containing the generated future synthetic data.
- models.MonteCarlo.generate_synthetic_data_for_MCMC(df, n_samples=10000)
Function to generate synthetic data for all columns in the DataFrame using the Markov Chain Monte Carlo (MCMC) method.
Parameters:
- dfpandas DataFrame
DataFrame containing the original data.
- n_samplesint
Number of samples to generate.
Returns:
- pandas DataFrame
DataFrame containing the generated synthetic data.
- models.MonteCarlo.generate_synthetic_data_for_MCS(df, n_samples=10000)
Function to generate synthetic data for all columns in the DataFrame using the Monte Carlo Simulation model.
Parameters:
- dfpandas DataFrame
DataFrame containing the original data.
- n_samplesint
Number of samples to generate.
Returns:
- pandas DataFrame
DataFrame containing the generated synthetic data.
- models.MonteCarlo.impute_missing_data_MC(df)
Function to impute missing data using the Monte Carlo Simulation model.
Parameters:
- dfpandas DataFrame
DataFrame containing the original data.
Returns:
- pandas DataFrame
DataFrame containing the imputed synthetic data.
- models.MonteCarlo.impute_missing_data_MCMC(df)
Function to impute missing data using the Markov Chain Monte Carlo (MCMC) method.
Parameters:
- dfpandas DataFrame
DataFrame containing the original data.
Returns:
- pandas DataFrame
DataFrame containing the imputed synthetic data.
- models.MonteCarlo.objective(params, original_data)
Objective function for hyperparameter tuning of the Monte Carlo Simulation model.
Parameters:
- paramstuple
Tuple containing the mean and standard deviation of the data distribution.
- original_datanumpy array
Array containing the original data.
Returns:
- float
Kolmogorov-Smirnov statistic between the original and generated data distributions.
- models.MonteCarlo.tune_parameters(original_data)
Function to tune the parameters of the Monte Carlo Simulation model.
Parameters:
- original_datanumpy array
Array containing the original data.
Returns:
- tuple
Tuple containing the optimal mean and standard deviation for the Monte Carlo Simulation model.