utilities package

Submodules

utilities.data_desc module

This module contains utility functions for data description and analysis.

Functions:

infer_frequency: Infers the frequency of timestamps in a dataset using the mode of time differences.
identify_timestamp_column: Identifies the first column in the DataFrame that is likely a timestamp.
determine_null_values: Determines the null / missing values in the DataFrame.
display_100_percent_null_columns: Identifies columns that are 100% null in the DataFrame and returns their names.
convert_timestamp_columns: Converts all columns that are likely to be timestamps to datetime format.
find_outliers: Finds outliers in numerical columns of the DataFrame using IQR and Z-score methods.
color_null_red: Color null values red, leave others unchanged.
highlight_null_columns: Highlight entire column if it’s completely null.
find_missing_timestamps: Identifies missing timestamps in a DataFrame, optionally within groups.
find_duplicate_values: Identifies duplicate values in the DataFrame and returns the count of duplicates for each row.
convert_string_to_number: Converts string values in the DataFrame to numeric values where possible.
convert_df: Converts the DataFrame to a CSV file and returns the CSV file as bytes.
highlight_new_values: Highlights new values in the DataFrame compared to the original DataFrame.

Dependencies:

pandas
numpy
mode from scipy.stats
zscore from scipy.stats
re
streamlit
warnings

utilities.data_desc.color_null_red(value)

Color null values red, leave others unchanged.

Parameters:

valueany: Value to be checked for null.

Returns:

str: CSS style string to color null values red.

utilities.data_desc.convert_df(df)

Converts the DataFrame to a CSV file and returns the CSV file as bytes to be downloaded.

Parameters:

dfpandas DataFrame: Input DataFrame to be converted to CSV.

Returns:

bytes: CSV file as bytes.

utilities.data_desc.convert_string_to_number(df)

Converts string values in the DataFrame to numeric values where possible.

Parameters:

dfpandas DataFrame or Series: Input DataFrame or Series containing string values to be converted.

Returns:

pandas DataFrame or Series: DataFrame or Series with string values converted to numeric values where possible.

utilities.data_desc.convert_timestamp_columns(df)

Converts all columns that are likely to be timestamps to datetime format.

Parameters:

dfpandas DataFrame: Input DataFrame.

Returns:

pandas DataFrame: DataFrame with columns that are likely timestamps converted to datetime format.

utilities.data_desc.determine_null_values(df)

Determines the null / missing values in the DataFrame.

Parameters:

dfpandas DataFrame: Input DataFrame.

Returns:

pandas DataFrame: DataFrame containing the count and percentage of missing values for each column and the total percentage of missing values.

utilities.data_desc.display_100_percent_null_columns(df)

Identifies columns that are 100% null in the DataFrame and returns their names.

Parameters:

dfpandas DataFrame: Input DataFrame.

Returns:

str: String containing the names of columns that are 100% null, separated by commas.

utilities.data_desc.find_duplicate_values(df)

Identifies duplicate values in the DataFrame and returns the count of duplicates for each row.

Parameters:

dfpandas DataFrame: Input DataFrame.

Returns:

pandas DataFrame: DataFrame containing the count of duplicates for each row.

utilities.data_desc.find_missing_timestamps(df, group_col=None)

Identifies missing timestamps in a DataFrame, optionally within groups.

Parameters:

dfpandas DataFrame: Input DataFrame containing timestamps.
group_colstr, optional: Name of the column to group by when identifying missing timestamps within groups. Default is None.

Returns:

pandas DataFrame: DataFrame containing the missing timestamps.

Example Usage:

>>> # Find missing timestamps across the entire DataFrame
>>> missing_timestamps = find_missing_timestamps(df)
>>> # Find missing timestamps within each group
>>> missing_timestamps_grouped = find_missing_timestamps(df, group_col='GroupID')

utilities.data_desc.find_outliers(df)

Finds outliers in numerical columns of the DataFrame using IQR and Z-score methods.

Parameters:

dfpandas DataFrame: Input DataFrame.

Returns:

pandas DataFrame: DataFrame containing the percentage of outliers for each numerical column and the total percentage of outliers.

utilities.data_desc.find_outliers_and_highlight(df)

utilities.data_desc.highlight_new_values(df, original_df)

Highlights new values in the DataFrame compared to the original DataFrame.

Parameters:

dfpandas DataFrame: DataFrame to be compared.
original_dfpandas DataFrame: Original DataFrame to compare against.

Returns:

pandas.io.formats.style.Styler: Styler object with new values highlighted in blue.

utilities.data_desc.highlight_null_columns(column)

Highlight entire column if it’s completely null.

Parameters:

columnpandas Series: Input column to be checked for null values.

Returns:

list: List of CSS style strings to highlight the entire column if it’s completely null.

utilities.data_desc.identify_timestamp_column(df)

Identifies the first column in the DataFrame that is likely a timestamp.

Parameters:

dfpandas DataFrame: Input DataFrame.

Returns:

str: Name of the column that is likely a timestamp, or None if no timestamp column is identified.

utilities.data_desc.infer_frequency(df, timestamp_col=None)

Infers the frequency of timestamps in a dataset using the mode of time differences.

Parameters:

dfpandas DataFrame: Input DataFrame containing timestamps.
timestamp_colstr, optional: Name of the column containing timestamps. If None, the function will attempt to identify the timestamp column.

Returns:

str: Inferred frequency of timestamps in the dataset (e.g., ‘Hourly’, ‘Daily’, ‘Weekly’, ‘Monthly’ ,’Yearly’, ‘Unknown’).

utilities.data_preprocess module

This module contains functions to preprocess data, including handling missing values, outliers, and duplicates.

Functions:

detect_outliers: Function to detect outliers in the DataFrame.
handle_outliers: Function to handle outliers in the DataFrame.
handle_duplicate_values: Function to handle duplicate values in the DataFrame.
handle_missing_timestamps: Function to handle missing timestamps in the DataFrame.
handle_null_columns: Function to handle null columns in the DataFrame.
drop_duplicate_timestamp_columns: Function to drop duplicate timestamp columns from the DataFrame.
handle_duplicate_timestamps: Function to handle duplicate timestamps in the DataFrame.
is_acceptable_error: Function to check if the generated data is within an acceptable error range.

Dependencies:

pandas
numpy
scipy
sklearn
data_desc

utilities.data_preprocess.detect_outliers(df, col_name)

Function to detect outliers in the DataFrame using Z-Score and IQR methods.

Parameters:

dfDataFrame: DataFrame to be processed.
col_namestr: Name of the column to detect outliers.

Returns:

DataFrame: DataFrame with columns indicating outliers.

utilities.data_preprocess.drop_duplicate_timestamp_columns(df)

Function to drop duplicate timestamp columns from the DataFrame.

Parameters:

dfDataFrame: DataFrame to be processed.

Returns:

DataFrame: DataFrame with duplicate timestamp columns dropped.

utilities.data_preprocess.handle_duplicate_timestamps(df, timestamp_col, agg_method='mean')

Function to handle duplicate timestamps in the DataFrame by aggregating the values.

Parameters:

dfDataFrame: DataFrame to be processed.
timestamp_colstr: Name of the timestamp column.
agg_methodstr, optional: Aggregation method to use for duplicate timestamps. Default is ‘mean’.

Returns:

DataFrame: DataFrame with duplicate timestamps handled.

utilities.data_preprocess.handle_duplicate_values(df)

Function to handle duplicate values in the DataFrame.

Parameters:

dfDataFrame: DataFrame to be processed.

Returns:

DataFrame: DataFrame with duplicate values handled.

utilities.data_preprocess.handle_missing_timestamps(df)

Function to handle missing timestamps in the DataFrame by resampling the data.

Parameters:

dfDataFrame: DataFrame to be processed.

Returns:

DataFrame: DataFrame with missing timestamps handled.

utilities.data_preprocess.handle_null_columns(df)

Function to handle columns with null values in the DataFrame.

Parameters:

dfDataFrame: DataFrame to be processed.

Returns:

DataFrame: DataFrame with null columns handled.

utilities.data_preprocess.handle_outliers(df)

Function to handle outliers in the DataFrame by setting the values of the outliers to NaN.

Parameters:

dfDataFrame: DataFrame to be processed.

Returns:

DataFrame: DataFrame with outliers handled.

utilities.data_preprocess.is_acceptable_error(original_data, generated_data, acceptable_error_margin=5)

Function to check if the generated data is within an acceptable error range.

Parameters:

original_datapandas DataFrame: Original data used to train the models.
generated_datapandas DataFrame: Synthetic data generated by the models.
acceptable_error_marginfloat, optional: Acceptable error margin in percentage. Default is 5.

Returns:

bool: True if the error percentage is within the acceptable error margin, False otherwise.

Example:

>>> is_acceptable_error(original_data, generated_data, acceptable_error_margin=5)

utilities.data_properties_utils module

This module contains utility functions to calculate the statistical properties of the data in the DataFrame.

Functions:

calculate_stats: Function to calculate the statistical properties of the data in the DataFrame.
data_properties: Function to calculate the properties of the data in the DataFrames.
calculate_percentage_change: Function to calculate percentage change between same attributes in different DataFrames.

Dependencies:

pandas
scipy
sklearn
numpy

Intrinsic Properties used:

Distribution (Normal, Exponential, Binomial)
Mean
Median
Mode
Min
Max
Range (Max - Min)
IQR (Interquartile Range)
Skewness (Symmetry of the data distribution)
Kurtosis (Tails of the data distribution)
Outliers (Number of outliers)
Missing Values
Variance (Spread of the data distribution)
Std
Slope
Intercept (Y-Intercept)
Score (R^2 value)
Normality (Shapiro-Wilk test)
Correlation (Pearson correlation coefficient)
Confidence Interval (95% confidence interval)
Hyphotesis Test (T-test)

utilities.data_properties_utils.calculate_percentage_change(df)

Function to calculate percentage change between same attributes in different DataFrames.

Parameters:: df (DataFrame) – DataFrame containing the results of the data properties analysis.
Returns:: pct_change – DataFrame with the percentage change between the two DataFrames.
Return type:: DataFrame

utilities.data_properties_utils.calculate_stats(df)

Function to calculate the statistical properties of the data in the DataFrame.

Parameters:: df (DataFrame) – DataFrame to be processed.
Returns:: results – DataFrame with the results of the data properties analysis.
Return type:: DataFrame

utilities.data_properties_utils.data_properties(dfs)

Function to calculate the properties of the data in the DataFrames.

Parameters:: dfs (list) – List of DataFrames to be processed.
Returns:: all_stats_results – DataFrame with the results of the data properties analysis.
Return type:: DataFrame

utilities.evaluate_models module

This module contains functions to evaluate the models used to generate synthetic data.

Functions:

generate_data: Function to generate synthetic data using the given model.
find_best_model_parallel_generation: Function to find the best model for parallel generation.
find_best_model_parallel_imputation: Function to find the best model for parallel imputation.
evaluate_all_models: Function to evaluate the models.

Dependencies:

numpy
pandas
concurrent.futures
traceback
streamlit
data_desc
data_properties_utils
KDE
ITS
Copula
MonteCarlo
Imputation

utilities.evaluate_models.evaluate_all_models(original_data, generated_data_dict)

Function to evaluate the models.

Parameters:

original_datapandas DataFrame: Original data used to train the models.
generated_data_dictdict: Dictionary containing the generated data by the models.

Returns:

str: Name of the best model based on the evaluation.

utilities.evaluate_models.find_best_model_parallel_generation(original_data)

Function to find the best model for parallel generation.

Parameters:

original_datapandas DataFrame: Original data used to train the models.

Returns:

str: Name of the best model for parallel generation.

utilities.evaluate_models.find_best_model_parallel_imputation(original_data)

Function to find the best model for parallel imputation.

Parameters:

original_datapandas DataFrame: Original data used to train the models.

Returns:

str: Name of the best model for parallel imputation.

utilities.evaluate_models.generate_data(model_function, original_data, **kwargs)

Function to generate synthetic data using the given model.

Parameters:

model_functionfunction: Function to generate synthetic data.
original_datapandas DataFrame: Original data used to train the model.
kwargsdict: Additional arguments to pass to the model function.

Returns:

pandas DataFrame: Synthetic data generated by the model.

Example:

>>> generate_data(generate_synthetic_data_for_KDE, original_data)

utilities package

Submodules

utilities.data_desc module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Example Usage:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

utilities.data_preprocess module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Example:

utilities.data_properties_utils module

utilities.evaluate_models module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Example:

Module contents