Two-dimensional visualization of the missing values in a dataset. missingval_plot ( data:, cmap: str = 'PuBuGn', figsize: Tuple = (20, 20), sort: bool = False, spine_color: str = '#EEEEEE' ) ¶ Returns the Axes object with the plot for further tweaking.
Type of split to be performed, by default None If a Pandas DataFrame is provided, the index/column information is used to label the plots split : Optional, optional Parameters:ĢD dataset that can be coerced into Pandas DataFrame. Returns a color-encoded correlation matrix. corr_mat ( data:, split: Optional = None, threshold: float = 0, target: Union = None, method: str = 'pearson', colored: bool = True ) → Union ¶ Use to control the color of the bars indicating the least common values, by default “#d8b365” Use to control the color of the bars indicating the most common values, by default “#5ab4ac” bar_color_bottom : str, optional Show the “bottom” most frequent values in a column, by default 3 bar_color_top : str, optional Show the “top” most frequent values in a column, by default 3 bottom : int, optional Use to control the figure size, by default (18, 18) top : int, optional If a Pandas DataFrame is provided, the index/column information is used to label the plots figsize : Tuple, optional Two-dimensional visualization of the number and frequency of categorical features. For major changes or feedback, please open an issue first to discuss what you would like to change.Functions for descriptive analytics. Pull requests and ideas, especially for further functions are welcome. cat_plot ( data, top = 4, bottom = 4 ) # representation of the 4 most & least common values in each categorical columnįurther examples, as well as applications of the functions in klib.clean() can be found here.
dist_plot ( df ) # default representation of a distribution plot, other settings include fill_range, histogram. corr_plot ( df, target = 'wine' ) # default representation of correlations with the feature column corr_plot ( df, split = 'neg' ) # displaying only negative correlations corr_plot ( df, split = 'pos' ) # displaying only positive correlations, other settings include threshold, cmap. missingval_plot ( df ) # default representation of missing values in a DataFrame, plenty of settings are available loss of information Examplesįind all available examples as well as applications of the functions in klib.clean() with detailed descriptions here. pool_duplicate_subsets ( df ) # pools subset of cols based on duplicates with min. mv_col_handling ( df ) # drops features with high ratio of missing vals based on informational content - klib. drop_missing ( df ) # drops missing values, also called in data_cleaning() - klib. convert_datatypes ( df ) # converts existing to more efficient dtypes, also called inside data_cleaning() - klib. clean_column_names ( df ) # cleans and standardizes column names, also called inside data_cleaning() - klib. data_cleaning ( df ) # performs datacleaning (drop duplicates & empty rows/cols, adjust dtypes.) - klib. missingval_plot ( df ) # returns a figure containing information about missing values # klib.clean - functions for cleaning datasets - klib. dist_plot ( df ) # returns a distribution plot for every numeric feature - klib. corr_plot ( df ) # returns a color-encoded heatmap, ideal for correlations - klib. corr_mat ( df ) # returns a color-encoded correlation matrix - klib. cat_plot ( df ) # returns a visualization of the number and frequency of categorical features - klib. DataFrame ( data ) # scribe - functions for visualizing datasets - klib. Usage import klib import pandas as pd df = pd.
Use the package manager pip to install klib.Īlternatively, to install this package with conda run: Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor). Klib is a Python library for importing, cleaning, analyzing and preprocessing data.