utilities package

Submodules

utilities.data_desc module

This module contains utility functions for data description and analysis.

Functions:
  • infer_frequency: Infers the frequency of timestamps in a dataset using the mode of time differences.

  • identify_timestamp_column: Identifies the first column in the DataFrame that is likely a timestamp.

  • determine_null_values: Determines the null / missing values in the DataFrame.

  • display_100_percent_null_columns: Identifies columns that are 100% null in the DataFrame and returns their names.

  • convert_timestamp_columns: Converts all columns that are likely to be timestamps to datetime format.

  • find_outliers: Finds outliers in numerical columns of the DataFrame using IQR and Z-score methods.

  • color_null_red: Color null values red, leave others unchanged.

  • highlight_null_columns: Highlight entire column if it’s completely null.

  • find_missing_timestamps: Identifies missing timestamps in a DataFrame, optionally within groups.

  • find_duplicate_values: Identifies duplicate values in the DataFrame and returns the count of duplicates for each row.

  • convert_string_to_number: Converts string values in the DataFrame to numeric values where possible.

  • convert_df: Converts the DataFrame to a CSV file and returns the CSV file as bytes.

  • highlight_new_values: Highlights new values in the DataFrame compared to the original DataFrame.

Dependencies:
  • pandas

  • numpy

  • mode from scipy.stats

  • zscore from scipy.stats

  • re

  • streamlit

  • warnings

utilities.data_desc.color_null_red(value)

Color null values red, leave others unchanged.

Parameters:

valueany

Value to be checked for null.

Returns:

str

CSS style string to color null values red.

utilities.data_desc.convert_df(df)

Converts the DataFrame to a CSV file and returns the CSV file as bytes to be downloaded.

Parameters:

dfpandas DataFrame

Input DataFrame to be converted to CSV.

Returns:

bytes

CSV file as bytes.

utilities.data_desc.convert_string_to_number(df)

Converts string values in the DataFrame to numeric values where possible.

Parameters:

dfpandas DataFrame or Series

Input DataFrame or Series containing string values to be converted.

Returns:

pandas DataFrame or Series

DataFrame or Series with string values converted to numeric values where possible.

utilities.data_desc.convert_timestamp_columns(df)

Converts all columns that are likely to be timestamps to datetime format.

Parameters:

dfpandas DataFrame

Input DataFrame.

Returns:

pandas DataFrame

DataFrame with columns that are likely timestamps converted to datetime format.

utilities.data_desc.determine_null_values(df)

Determines the null / missing values in the DataFrame.

Parameters:

dfpandas DataFrame

Input DataFrame.

Returns:

pandas DataFrame

DataFrame containing the count and percentage of missing values for each column and the total percentage of missing values.

utilities.data_desc.display_100_percent_null_columns(df)

Identifies columns that are 100% null in the DataFrame and returns their names.

Parameters:

dfpandas DataFrame

Input DataFrame.

Returns:

str

String containing the names of columns that are 100% null, separated by commas.

utilities.data_desc.find_duplicate_values(df)

Identifies duplicate values in the DataFrame and returns the count of duplicates for each row.

Parameters:

dfpandas DataFrame

Input DataFrame.

Returns:

pandas DataFrame

DataFrame containing the count of duplicates for each row.

utilities.data_desc.find_missing_timestamps(df, group_col=None)

Identifies missing timestamps in a DataFrame, optionally within groups.

Parameters:

dfpandas DataFrame

Input DataFrame containing timestamps.

group_colstr, optional

Name of the column to group by when identifying missing timestamps within groups. Default is None.

Returns:

pandas DataFrame

DataFrame containing the missing timestamps.

Example Usage:

>>> # Find missing timestamps across the entire DataFrame
>>> missing_timestamps = find_missing_timestamps(df)
>>> # Find missing timestamps within each group
>>> missing_timestamps_grouped = find_missing_timestamps(df, group_col='GroupID')
utilities.data_desc.find_outliers(df)

Finds outliers in numerical columns of the DataFrame using IQR and Z-score methods.

Parameters:

dfpandas DataFrame

Input DataFrame.

Returns:

pandas DataFrame

DataFrame containing the percentage of outliers for each numerical column and the total percentage of outliers.

utilities.data_desc.find_outliers_and_highlight(df)
utilities.data_desc.highlight_new_values(df, original_df)

Highlights new values in the DataFrame compared to the original DataFrame.

Parameters:

dfpandas DataFrame

DataFrame to be compared.

original_dfpandas DataFrame

Original DataFrame to compare against.

Returns:

pandas.io.formats.style.Styler

Styler object with new values highlighted in blue.

utilities.data_desc.highlight_null_columns(column)

Highlight entire column if it’s completely null.

Parameters:

columnpandas Series

Input column to be checked for null values.

Returns:

list

List of CSS style strings to highlight the entire column if it’s completely null.

utilities.data_desc.identify_timestamp_column(df)

Identifies the first column in the DataFrame that is likely a timestamp.

Parameters:

dfpandas DataFrame

Input DataFrame.

Returns:

str

Name of the column that is likely a timestamp, or None if no timestamp column is identified.

utilities.data_desc.infer_frequency(df, timestamp_col=None)

Infers the frequency of timestamps in a dataset using the mode of time differences.

Parameters:

dfpandas DataFrame

Input DataFrame containing timestamps.

timestamp_colstr, optional

Name of the column containing timestamps. If None, the function will attempt to identify the timestamp column.

Returns:

str

Inferred frequency of timestamps in the dataset (e.g., ‘Hourly’, ‘Daily’, ‘Weekly’, ‘Monthly’ ,’Yearly’, ‘Unknown’).

utilities.data_preprocess module

This module contains functions to preprocess data, including handling missing values, outliers, and duplicates.

Functions:
  • detect_outliers: Function to detect outliers in the DataFrame.

  • handle_outliers: Function to handle outliers in the DataFrame.

  • handle_duplicate_values: Function to handle duplicate values in the DataFrame.

  • handle_missing_timestamps: Function to handle missing timestamps in the DataFrame.

  • handle_null_columns: Function to handle null columns in the DataFrame.

  • drop_duplicate_timestamp_columns: Function to drop duplicate timestamp columns from the DataFrame.

  • handle_duplicate_timestamps: Function to handle duplicate timestamps in the DataFrame.

  • is_acceptable_error: Function to check if the generated data is within an acceptable error range.

Dependencies:
  • pandas

  • numpy

  • scipy

  • sklearn

  • data_desc

utilities.data_preprocess.detect_outliers(df, col_name)

Function to detect outliers in the DataFrame using Z-Score and IQR methods.

Parameters:

dfDataFrame

DataFrame to be processed.

col_namestr

Name of the column to detect outliers.

Returns:

DataFrame

DataFrame with columns indicating outliers.

utilities.data_preprocess.drop_duplicate_timestamp_columns(df)

Function to drop duplicate timestamp columns from the DataFrame.

Parameters:

dfDataFrame

DataFrame to be processed.

Returns:

DataFrame

DataFrame with duplicate timestamp columns dropped.

utilities.data_preprocess.handle_duplicate_timestamps(df, timestamp_col, agg_method='mean')

Function to handle duplicate timestamps in the DataFrame by aggregating the values.

Parameters:

dfDataFrame

DataFrame to be processed.

timestamp_colstr

Name of the timestamp column.

agg_methodstr, optional

Aggregation method to use for duplicate timestamps. Default is ‘mean’.

Returns:

DataFrame

DataFrame with duplicate timestamps handled.

utilities.data_preprocess.handle_duplicate_values(df)

Function to handle duplicate values in the DataFrame.

Parameters:

dfDataFrame

DataFrame to be processed.

Returns:

DataFrame

DataFrame with duplicate values handled.

utilities.data_preprocess.handle_missing_timestamps(df)

Function to handle missing timestamps in the DataFrame by resampling the data.

Parameters:

dfDataFrame

DataFrame to be processed.

Returns:

DataFrame

DataFrame with missing timestamps handled.

utilities.data_preprocess.handle_null_columns(df)

Function to handle columns with null values in the DataFrame.

Parameters:

dfDataFrame

DataFrame to be processed.

Returns:

DataFrame

DataFrame with null columns handled.

utilities.data_preprocess.handle_outliers(df)

Function to handle outliers in the DataFrame by setting the values of the outliers to NaN.

Parameters:

dfDataFrame

DataFrame to be processed.

Returns:

DataFrame

DataFrame with outliers handled.

utilities.data_preprocess.is_acceptable_error(original_data, generated_data, acceptable_error_margin=5)

Function to check if the generated data is within an acceptable error range.

Parameters:

original_datapandas DataFrame

Original data used to train the models.

generated_datapandas DataFrame

Synthetic data generated by the models.

acceptable_error_marginfloat, optional

Acceptable error margin in percentage. Default is 5.

Returns:

bool

True if the error percentage is within the acceptable error margin, False otherwise.

Example:

>>> is_acceptable_error(original_data, generated_data, acceptable_error_margin=5)

utilities.data_properties_utils module

This module contains utility functions to calculate the statistical properties of the data in the DataFrame.

Functions:
  • calculate_stats: Function to calculate the statistical properties of the data in the DataFrame.

  • data_properties: Function to calculate the properties of the data in the DataFrames.

  • calculate_percentage_change: Function to calculate percentage change between same attributes in different DataFrames.

Dependencies:
  • pandas

  • scipy

  • sklearn

  • numpy

Intrinsic Properties used:
  • Distribution (Normal, Exponential, Binomial)

  • Mean

  • Median

  • Mode

  • Min

  • Max

  • Range (Max - Min)

  • IQR (Interquartile Range)

  • Skewness (Symmetry of the data distribution)

  • Kurtosis (Tails of the data distribution)

  • Outliers (Number of outliers)

  • Missing Values

  • Variance (Spread of the data distribution)

  • Std

  • Slope

  • Intercept (Y-Intercept)

  • Score (R^2 value)

  • Normality (Shapiro-Wilk test)

  • Correlation (Pearson correlation coefficient)

  • Confidence Interval (95% confidence interval)

  • Hyphotesis Test (T-test)

utilities.data_properties_utils.calculate_percentage_change(df)

Function to calculate percentage change between same attributes in different DataFrames.

Parameters:

df (DataFrame) – DataFrame containing the results of the data properties analysis.

Returns:

pct_change – DataFrame with the percentage change between the two DataFrames.

Return type:

DataFrame

utilities.data_properties_utils.calculate_stats(df)

Function to calculate the statistical properties of the data in the DataFrame.

Parameters:

df (DataFrame) – DataFrame to be processed.

Returns:

results – DataFrame with the results of the data properties analysis.

Return type:

DataFrame

utilities.data_properties_utils.data_properties(dfs)

Function to calculate the properties of the data in the DataFrames.

Parameters:

dfs (list) – List of DataFrames to be processed.

Returns:

all_stats_results – DataFrame with the results of the data properties analysis.

Return type:

DataFrame

utilities.evaluate_models module

This module contains functions to evaluate the models used to generate synthetic data.

Functions:
  • generate_data: Function to generate synthetic data using the given model.

  • find_best_model_parallel_generation: Function to find the best model for parallel generation.

  • find_best_model_parallel_imputation: Function to find the best model for parallel imputation.

  • evaluate_all_models: Function to evaluate the models.

Dependencies:
  • numpy

  • pandas

  • concurrent.futures

  • traceback

  • streamlit

  • data_desc

  • data_properties_utils

  • KDE

  • ITS

  • Copula

  • MonteCarlo

  • Imputation

utilities.evaluate_models.evaluate_all_models(original_data, generated_data_dict)

Function to evaluate the models.

Parameters:

original_datapandas DataFrame

Original data used to train the models.

generated_data_dictdict

Dictionary containing the generated data by the models.

Returns:

str

Name of the best model based on the evaluation.

utilities.evaluate_models.find_best_model_parallel_generation(original_data)

Function to find the best model for parallel generation.

Parameters:

original_datapandas DataFrame

Original data used to train the models.

Returns:

str

Name of the best model for parallel generation.

utilities.evaluate_models.find_best_model_parallel_imputation(original_data)

Function to find the best model for parallel imputation.

Parameters:

original_datapandas DataFrame

Original data used to train the models.

Returns:

str

Name of the best model for parallel imputation.

utilities.evaluate_models.generate_data(model_function, original_data, **kwargs)

Function to generate synthetic data using the given model.

Parameters:

model_functionfunction

Function to generate synthetic data.

original_datapandas DataFrame

Original data used to train the model.

kwargsdict

Additional arguments to pass to the model function.

Returns:

pandas DataFrame

Synthetic data generated by the model.

Example:

>>> generate_data(generate_synthetic_data_for_KDE, original_data)

Module contents