utilities package
Submodules
utilities.data_desc module
This module contains utility functions for data description and analysis.
- Functions:
infer_frequency: Infers the frequency of timestamps in a dataset using the mode of time differences.
identify_timestamp_column: Identifies the first column in the DataFrame that is likely a timestamp.
determine_null_values: Determines the null / missing values in the DataFrame.
display_100_percent_null_columns: Identifies columns that are 100% null in the DataFrame and returns their names.
convert_timestamp_columns: Converts all columns that are likely to be timestamps to datetime format.
find_outliers: Finds outliers in numerical columns of the DataFrame using IQR and Z-score methods.
color_null_red: Color null values red, leave others unchanged.
highlight_null_columns: Highlight entire column if it’s completely null.
find_missing_timestamps: Identifies missing timestamps in a DataFrame, optionally within groups.
find_duplicate_values: Identifies duplicate values in the DataFrame and returns the count of duplicates for each row.
convert_string_to_number: Converts string values in the DataFrame to numeric values where possible.
convert_df: Converts the DataFrame to a CSV file and returns the CSV file as bytes.
highlight_new_values: Highlights new values in the DataFrame compared to the original DataFrame.
- Dependencies:
pandas
numpy
mode from scipy.stats
zscore from scipy.stats
re
streamlit
warnings
- utilities.data_desc.color_null_red(value)
Color null values red, leave others unchanged.
Parameters:
- valueany
Value to be checked for null.
Returns:
- str
CSS style string to color null values red.
- utilities.data_desc.convert_df(df)
Converts the DataFrame to a CSV file and returns the CSV file as bytes to be downloaded.
Parameters:
- dfpandas DataFrame
Input DataFrame to be converted to CSV.
Returns:
- bytes
CSV file as bytes.
- utilities.data_desc.convert_string_to_number(df)
Converts string values in the DataFrame to numeric values where possible.
Parameters:
- dfpandas DataFrame or Series
Input DataFrame or Series containing string values to be converted.
Returns:
- pandas DataFrame or Series
DataFrame or Series with string values converted to numeric values where possible.
- utilities.data_desc.convert_timestamp_columns(df)
Converts all columns that are likely to be timestamps to datetime format.
Parameters:
- dfpandas DataFrame
Input DataFrame.
Returns:
- pandas DataFrame
DataFrame with columns that are likely timestamps converted to datetime format.
- utilities.data_desc.determine_null_values(df)
Determines the null / missing values in the DataFrame.
Parameters:
- dfpandas DataFrame
Input DataFrame.
Returns:
- pandas DataFrame
DataFrame containing the count and percentage of missing values for each column and the total percentage of missing values.
- utilities.data_desc.display_100_percent_null_columns(df)
Identifies columns that are 100% null in the DataFrame and returns their names.
Parameters:
- dfpandas DataFrame
Input DataFrame.
Returns:
- str
String containing the names of columns that are 100% null, separated by commas.
- utilities.data_desc.find_duplicate_values(df)
Identifies duplicate values in the DataFrame and returns the count of duplicates for each row.
Parameters:
- dfpandas DataFrame
Input DataFrame.
Returns:
- pandas DataFrame
DataFrame containing the count of duplicates for each row.
- utilities.data_desc.find_missing_timestamps(df, group_col=None)
Identifies missing timestamps in a DataFrame, optionally within groups.
Parameters:
- dfpandas DataFrame
Input DataFrame containing timestamps.
- group_colstr, optional
Name of the column to group by when identifying missing timestamps within groups. Default is None.
Returns:
- pandas DataFrame
DataFrame containing the missing timestamps.
Example Usage:
>>> # Find missing timestamps across the entire DataFrame >>> missing_timestamps = find_missing_timestamps(df) >>> # Find missing timestamps within each group >>> missing_timestamps_grouped = find_missing_timestamps(df, group_col='GroupID')
- utilities.data_desc.find_outliers(df)
Finds outliers in numerical columns of the DataFrame using IQR and Z-score methods.
Parameters:
- dfpandas DataFrame
Input DataFrame.
Returns:
- pandas DataFrame
DataFrame containing the percentage of outliers for each numerical column and the total percentage of outliers.
- utilities.data_desc.find_outliers_and_highlight(df)
- utilities.data_desc.highlight_new_values(df, original_df)
Highlights new values in the DataFrame compared to the original DataFrame.
Parameters:
- dfpandas DataFrame
DataFrame to be compared.
- original_dfpandas DataFrame
Original DataFrame to compare against.
Returns:
- pandas.io.formats.style.Styler
Styler object with new values highlighted in blue.
- utilities.data_desc.highlight_null_columns(column)
Highlight entire column if it’s completely null.
Parameters:
- columnpandas Series
Input column to be checked for null values.
Returns:
- list
List of CSS style strings to highlight the entire column if it’s completely null.
- utilities.data_desc.identify_timestamp_column(df)
Identifies the first column in the DataFrame that is likely a timestamp.
Parameters:
- dfpandas DataFrame
Input DataFrame.
Returns:
- str
Name of the column that is likely a timestamp, or None if no timestamp column is identified.
- utilities.data_desc.infer_frequency(df, timestamp_col=None)
Infers the frequency of timestamps in a dataset using the mode of time differences.
Parameters:
- dfpandas DataFrame
Input DataFrame containing timestamps.
- timestamp_colstr, optional
Name of the column containing timestamps. If None, the function will attempt to identify the timestamp column.
Returns:
- str
Inferred frequency of timestamps in the dataset (e.g., ‘Hourly’, ‘Daily’, ‘Weekly’, ‘Monthly’ ,’Yearly’, ‘Unknown’).
utilities.data_preprocess module
This module contains functions to preprocess data, including handling missing values, outliers, and duplicates.
- Functions:
detect_outliers: Function to detect outliers in the DataFrame.
handle_outliers: Function to handle outliers in the DataFrame.
handle_duplicate_values: Function to handle duplicate values in the DataFrame.
handle_missing_timestamps: Function to handle missing timestamps in the DataFrame.
handle_null_columns: Function to handle null columns in the DataFrame.
drop_duplicate_timestamp_columns: Function to drop duplicate timestamp columns from the DataFrame.
handle_duplicate_timestamps: Function to handle duplicate timestamps in the DataFrame.
is_acceptable_error: Function to check if the generated data is within an acceptable error range.
- Dependencies:
pandas
numpy
scipy
sklearn
data_desc
- utilities.data_preprocess.detect_outliers(df, col_name)
Function to detect outliers in the DataFrame using Z-Score and IQR methods.
Parameters:
- dfDataFrame
DataFrame to be processed.
- col_namestr
Name of the column to detect outliers.
Returns:
- DataFrame
DataFrame with columns indicating outliers.
- utilities.data_preprocess.drop_duplicate_timestamp_columns(df)
Function to drop duplicate timestamp columns from the DataFrame.
Parameters:
- dfDataFrame
DataFrame to be processed.
Returns:
- DataFrame
DataFrame with duplicate timestamp columns dropped.
- utilities.data_preprocess.handle_duplicate_timestamps(df, timestamp_col, agg_method='mean')
Function to handle duplicate timestamps in the DataFrame by aggregating the values.
Parameters:
- dfDataFrame
DataFrame to be processed.
- timestamp_colstr
Name of the timestamp column.
- agg_methodstr, optional
Aggregation method to use for duplicate timestamps. Default is ‘mean’.
Returns:
- DataFrame
DataFrame with duplicate timestamps handled.
- utilities.data_preprocess.handle_duplicate_values(df)
Function to handle duplicate values in the DataFrame.
Parameters:
- dfDataFrame
DataFrame to be processed.
Returns:
- DataFrame
DataFrame with duplicate values handled.
- utilities.data_preprocess.handle_missing_timestamps(df)
Function to handle missing timestamps in the DataFrame by resampling the data.
Parameters:
- dfDataFrame
DataFrame to be processed.
Returns:
- DataFrame
DataFrame with missing timestamps handled.
- utilities.data_preprocess.handle_null_columns(df)
Function to handle columns with null values in the DataFrame.
Parameters:
- dfDataFrame
DataFrame to be processed.
Returns:
- DataFrame
DataFrame with null columns handled.
- utilities.data_preprocess.handle_outliers(df)
Function to handle outliers in the DataFrame by setting the values of the outliers to NaN.
Parameters:
- dfDataFrame
DataFrame to be processed.
Returns:
- DataFrame
DataFrame with outliers handled.
- utilities.data_preprocess.is_acceptable_error(original_data, generated_data, acceptable_error_margin=5)
Function to check if the generated data is within an acceptable error range.
Parameters:
- original_datapandas DataFrame
Original data used to train the models.
- generated_datapandas DataFrame
Synthetic data generated by the models.
- acceptable_error_marginfloat, optional
Acceptable error margin in percentage. Default is 5.
Returns:
- bool
True if the error percentage is within the acceptable error margin, False otherwise.
Example:
>>> is_acceptable_error(original_data, generated_data, acceptable_error_margin=5)
utilities.data_properties_utils module
This module contains utility functions to calculate the statistical properties of the data in the DataFrame.
- Functions:
calculate_stats: Function to calculate the statistical properties of the data in the DataFrame.
data_properties: Function to calculate the properties of the data in the DataFrames.
calculate_percentage_change: Function to calculate percentage change between same attributes in different DataFrames.
- Dependencies:
pandas
scipy
sklearn
numpy
- Intrinsic Properties used:
Distribution (Normal, Exponential, Binomial)
Mean
Median
Mode
Min
Max
Range (Max - Min)
IQR (Interquartile Range)
Skewness (Symmetry of the data distribution)
Kurtosis (Tails of the data distribution)
Outliers (Number of outliers)
Missing Values
Variance (Spread of the data distribution)
Std
Slope
Intercept (Y-Intercept)
Score (R^2 value)
Normality (Shapiro-Wilk test)
Correlation (Pearson correlation coefficient)
Confidence Interval (95% confidence interval)
Hyphotesis Test (T-test)
- utilities.data_properties_utils.calculate_percentage_change(df)
Function to calculate percentage change between same attributes in different DataFrames.
- Parameters:
df (DataFrame) – DataFrame containing the results of the data properties analysis.
- Returns:
pct_change – DataFrame with the percentage change between the two DataFrames.
- Return type:
DataFrame
- utilities.data_properties_utils.calculate_stats(df)
Function to calculate the statistical properties of the data in the DataFrame.
- Parameters:
df (DataFrame) – DataFrame to be processed.
- Returns:
results – DataFrame with the results of the data properties analysis.
- Return type:
DataFrame
- utilities.data_properties_utils.data_properties(dfs)
Function to calculate the properties of the data in the DataFrames.
- Parameters:
dfs (list) – List of DataFrames to be processed.
- Returns:
all_stats_results – DataFrame with the results of the data properties analysis.
- Return type:
DataFrame
utilities.evaluate_models module
This module contains functions to evaluate the models used to generate synthetic data.
- Functions:
generate_data: Function to generate synthetic data using the given model.
find_best_model_parallel_generation: Function to find the best model for parallel generation.
find_best_model_parallel_imputation: Function to find the best model for parallel imputation.
evaluate_all_models: Function to evaluate the models.
- Dependencies:
numpy
pandas
concurrent.futures
traceback
streamlit
data_desc
data_properties_utils
KDE
ITS
Copula
MonteCarlo
Imputation
- utilities.evaluate_models.evaluate_all_models(original_data, generated_data_dict)
Function to evaluate the models.
Parameters:
- original_datapandas DataFrame
Original data used to train the models.
- generated_data_dictdict
Dictionary containing the generated data by the models.
Returns:
- str
Name of the best model based on the evaluation.
- utilities.evaluate_models.find_best_model_parallel_generation(original_data)
Function to find the best model for parallel generation.
Parameters:
- original_datapandas DataFrame
Original data used to train the models.
Returns:
- str
Name of the best model for parallel generation.
- utilities.evaluate_models.find_best_model_parallel_imputation(original_data)
Function to find the best model for parallel imputation.
Parameters:
- original_datapandas DataFrame
Original data used to train the models.
Returns:
- str
Name of the best model for parallel imputation.
- utilities.evaluate_models.generate_data(model_function, original_data, **kwargs)
Function to generate synthetic data using the given model.
Parameters:
- model_functionfunction
Function to generate synthetic data.
- original_datapandas DataFrame
Original data used to train the model.
- kwargsdict
Additional arguments to pass to the model function.
Returns:
- pandas DataFrame
Synthetic data generated by the model.
Example:
>>> generate_data(generate_synthetic_data_for_KDE, original_data)