handyspark.sql package¶

Submodules¶

handyspark.sql.dataframe module¶

class handyspark.sql.dataframe.Bucket(colname, bins=5)[source]¶

Bases: object

Bucketizes a column of continuous values into equal sized bins to perform stratification.

Parameters:	colname (string) – Column containing continuous values bins (integer) – Number of equal sized bins to map original values to.
Returns:	bucket – Bucket object to be used as column in stratification.
Return type:	Bucket

colname¶

class handyspark.sql.dataframe.Handy(df)[source]¶

Bases: object

boxplot(colnames, ax=None, showfliers=True, k=1.5, **kwargs)[source]¶

classes¶

corr(colnames=None, method='pearson')[source]¶

disassemble(colname, new_colnames=None)[source]¶

fence(colnames, k=1.5)[source]¶

fences_¶

fill(*args, continuous=None, categorical=None, strategy=None)[source]¶

hist(colname, bins=10, ax=None, **kwargs)[source]¶

is_classification¶

isnull(ratio=False)[source]¶

max(colnames)[source]¶

mean(colnames)[source]¶

median(colnames)[source]¶

min(colnames)[source]¶

mode(colname)[source]¶

nclasses¶

ncols¶

nrows¶

nunique(colnames=None)[source]¶

outliers(colnames=None, ratio=False, method='tukey', **kwargs)[source]¶

q1(colnames)[source]¶

q3(colnames)[source]¶

response¶

scatterplot(colnames, ax=None, **kwargs)[source]¶

set_response(colname)[source]¶

shape¶

stages¶

statistics_¶

stddev(colnames)[source]¶

strata¶

to_metrics_RDD(prob_col, label)[source]¶

value_counts(colname, dropna=True)[source]¶

var(colnames)[source]¶

class handyspark.sql.dataframe.HandyColumns(df, handy, strata=None)[source]¶

Bases: object

HandyColumn(s) in a HandyFrame.

numerical¶: list of string – List of numerical columns (integer, float, double)

categorical¶: list of string – List of categorical columns (string, integer)

continuous¶: list of string – List of continous columns (float, double)

string¶: list of string – List of string columns (string)

array¶: list of string – List of array columns (array, map)

array: Returns list of array or map columns in the HandyFrame.

boxplot(ax=None, showfliers=True, k=1.5)[source]¶

Makes a box plot from HandyFrame column.

Parameters:	ax (matplotlib axes object, default None) – showfliers (bool, optional (True)) – Show the outliers beyond the caps. k (float, optional) – Constant multiplier for the IQR. Default is 1.5 (corresponding to Tukey’s outlier, use 3 for “far out” values)

categorical: Returns list of categorical columns in the HandyFrame.

continuous: Returns list of continuous columns in the HandyFrame.

corr(method='pearson')[source]¶

Compute pairwise correlation of columns, excluding NA/null values.

Parameters:	method ({'pearson', 'spearman'}) – pearson : standard correlation coefficient spearman : Spearman rank correlation
Returns:	y
Return type:	DataFrame

hist(bins=10, ax=None)[source]¶

Draws histogram of the HandyFrame’s column using matplotlib / pylab.

Parameters:	bins (integer, default 10) – Number of histogram bins to be used ax (matplotlib axes object, default None) –

max()[source]¶

mean()[source]¶

median()[source]¶

min()[source]¶

mode()[source]¶

Returns same-type modal (most common) value for each column.

Returns:	mode
Return type:	Series

numerical: Returns list of numerical columns in the HandyFrame.

nunique()[source]¶

Return Series with number of distinct observations for specified columns.

Returns:	nunique
Return type:	Series

outliers(ratio=False, method='tukey')[source]¶

Return Series with number of outlier observations according to the specified method for all columns.

Parameters:	ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True. method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns:	outliers
Return type:	Series

q1()[source]¶

q3()[source]¶

scatterplot(ax=None)[source]¶

Makes a scatter plot of two HandyFrame columns.

Parameters:	ax (matplotlib axes object, default None) –

stddev()[source]¶

string: Returns list of string columns in the HandyFrame.

value_counts(dropna=True)[source]¶

Returns object containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:	dropna (boolean, default True) – Don’t include counts of missing values.
Returns:	counts
Return type:	Series

var()[source]¶

class handyspark.sql.dataframe.HandyFrame(df, handy=None)[source]¶

Bases: pyspark.sql.dataframe.DataFrame

HandySpark version of DataFrame.

cols¶: HandyColumns – class to access pandas-like column based methods implemented in Spark

pandas¶: HandyPandas – class to access pandas-like column based methods through pandas UDFs

transformers¶: HandyTransformers – class to generate Handy transformers

stages¶: integer – number of stages in the execution plan

response¶: string – name of the response column

is_classification¶: boolean – True if response is a categorical variable

classes¶: list – list of classes for a classification problem

nclasses¶: integer – number of classes for a classification problem

ncols¶: integer – number of columns of the HandyFrame

nrows¶: integer – number of rows of the HandyFrame

shape¶: tuple – tuple representing dimensionality of the HandyFrame

statistics_¶: dict – imputation fill value for each feature If stratified, first level keys are filter clauses for stratification

fences_¶: dict – fence values for each feature If stratified, first level keys are filter clauses for stratification

is_stratified¶: boolean – True if HandyFrame was stratified

values¶: ndarray – Numpy representation of HandyFrame.

Available methods

- notHandy: makes it a plain Spark dataframe

- stratify: used to perform stratified operations

- isnull: checks for missing values

- fill: fills missing values

- outliers: checks for outliers

- fence: fences outliers

- set_safety_limit: defines new safety limit for collect operations

- safety_off: disables safety limit for a single operation

- assign: appends a new columns based on an expression

- nunique: returns number of unique values in each column

- set_response: sets column to be used as response / label

- disassemble: turns a vector / array column into multiple columns

- to_metrics_RDD: turns probability and label columns into a tuple RDD

apply(f, name=None, args=None, returnType=None)[source]¶: INTERNAL USE

assign(**kwargs)[source]¶

Assign new columns to a HandyFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

Parameters:	kwargs (keyword, value pairs) – keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. If the values are not callable, (e.g. a scalar, or string), they are simply assigned.
Returns:	df – A new HandyFrame with the new columns in addition to all the existing columns.
Return type:	HandyFrame

classes: Returns list of classes for a classification problem.

collect()[source]¶

Returns all the records as a list of Row.

By default, its output is limited by the safety limit. To get original collect behavior, call safety_off method first.

cols

Returns a class to access pandas-like column based methods implemented in Spark

Available methods: - min - max - median - q1 - q3 - stddev - value_counts - mode - corr - nunique - hist - boxplot - scatterplot

disassemble(colname, new_colnames=None)[source]¶

Disassembles a Vector or Array column into multiple columns.

Parameters:	colname (string) – Column containing Vector or Array elements. new_colnames (list of string, optional) – Default is None, column names are generated using a sequentially generated suffix (e.g., _0, _1, etc.) for `colname`. If informed, it must have as many column names as elements in the shortest vector/array of `colname`.
Returns:	df – A new HandyFrame with the new disassembled columns in addition to all the existing columns.
Return type:	HandyFrame

fence(colnames, k=1.5)[source]¶

Caps outliers using lower and upper fences given by Tukey’s method, using 1.5 times the interquartile range (IQR).

The fence values used for capping outliers are kept in fences_ property and can later be used to generate a corresponding HandyFencer transformer.

For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey’s_fences

Parameters:	colnames (list of string) – Column names to apply fencing. k (float, optional) – Constant multiplier for the IQR. Default is 1.5 (corresponding to Tukey’s outlier, use 3 for “far out” values)
Returns:	df – A new HandyFrame with capped outliers.
Return type:	HandyFrame

fences_: Returns dictionary with fence values for each feature. If stratified, first level keys are filter clauses for stratification.

fill(*args, categorical=None, continuous=None, strategy=None)[source]¶

Fill NA/NaN values using the specified methods.

The values used for imputation are kept in statistics_ property and can later be used to generate a corresponding HandyImputer transformer.

Parameters:	categorical ('all' or list of string, optional) – List of categorical columns. These columns are filled with its coresponding modes (most common values). continuous ('all' or list of string, optional) – List of continuous value columns. By default, these columns are filled with its corresponding means. If a same-sized list is provided in the `strategy` argument, it uses the corresponding straegy for each column. strategy (list of string, optional) – If informed, it must contain a strategy - either `mean` or `median` - for each one of the continuous columns.
Returns:	df – A new HandyFrame with filled missing values.
Return type:	HandyFrame

is_classification: Returns True if response is a categorical variable.

isnull(ratio=False)[source]¶

Returns array with counts of missing value for each column in the HandyFrame.

Parameters:	ratio (boolean, default False) – If True, returns ratios instead of absolute counts.
Returns:	counts
Return type:	Series

nclasses: Returns the number of classes for a classification problem.

ncols: Returns the number of columns of the HandyFrame.

notHandy()[source]¶: Converts HandyFrame back into Spark’s DataFrame

nrows: Returns the number of rows of the HandyFrame.

nunique()[source]¶

Return Series with number of distinct observations for all columns.

Returns:	nunique
Return type:	Series

outliers(ratio=False, method='tukey', **kwargs)[source]¶

Return Series with number of outlier observations according to the specified method for all columns.

Parameters:	ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True. method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns:	outliers
Return type:	Series

pandas

Returns a class to access pandas-like column based methods through pandas UDFs

Available methods: - betweeen / between_time - isin - isna / isnull - notna / notnull - abs - clip / clip_lower / clip_upper - replace - round / truncate - tz_convert / tz_localize

response: Returns the name of the response column.

safety_off()[source]¶: Disables safety limit for a single call of collect method.

set_response(colname)[source]¶

Sets column to be used as response in supervised learning algorithms.

Parameters:	colname (string) –
Returns:
Return type:	self

set_safety_limit(limit)[source]¶: Sets safety limit used for collect method.

shape: Return a tuple representing the dimensionality of the HandyFrame.

stages: Returns the number of stages in the execution plan.

statistics_: Returns dictionary with imputation fill value for each feature. If stratified, first level keys are filter clauses for stratification.

stratify(strata)[source]¶

Stratify the HandyFrame.

Stratified operations should be more efficient than group by operations, as they rely on three iterative steps, namely: filtering the underlying HandyFrame, performing the operation and aggregating the results.

take(num)[source]¶: Returns the first num rows as a list of Row.

to_metrics_RDD(prob_col='probability', label_col='label')[source]¶

Converts a DataFrame containing predicted probabilities and classification labels into a RDD suited for use with BinaryClassificationMetrics object.

Parameters:	prob_col (string, optional) – Column containing Vectors of probabilities. Default is ‘probability’. label_col (string, optional) – Column containing labels. Default is ‘label’.
Returns:	rdd – RDD of tuples (probability, label)
Return type:	RDD

transform(f, name=None, args=None, returnType=None)[source]¶: INTERNAL USE

transformers

Returns a class to generate Handy transformers

Available transformers: - HandyImputer - HandyFencer

values: Numpy representation of HandyFrame.

class handyspark.sql.dataframe.HandyGrouped(jgd, df, *args)[source]¶

Bases: pyspark.sql.group.GroupedData

agg(*exprs)[source]¶

class handyspark.sql.dataframe.HandyStrata(handy, strata)[source]¶: Bases: object

class handyspark.sql.dataframe.Quantile(colname, bins=5)[source]¶

Bases: handyspark.sql.dataframe.Bucket

Bucketizes a column of continuous values into quantiles to perform stratification.

Parameters:	colname (string) – Column containing continuous values bins (integer) – Number of quantiles to map original values to.
Returns:	quantile – Quantile object to be used as column in stratification.
Return type:	Quantile

handyspark.sql.dataframe.notHandy(self)[source]¶

handyspark.sql.dataframe.toHandy(self)[source]¶: Converts Spark DataFrame into HandyFrame.

handyspark.sql.datetime module¶

class handyspark.sql.datetime.HandyDatetime(df, colname)[source]¶: Bases: object

handyspark.sql.pandas module¶

class handyspark.sql.pandas.HandyPandas(df)[source]¶

Bases: object

dt¶

Returns a class to access pandas-like datetime column based methods through pandas UDFs

Available methods: - is_leap_year / is_month_end / is_month_start / is_quarter_end / is_quarter_start / is_year_end / is_year_start - strftime - tz / time / tz_convert / tz_localize - day / dayofweek / dayofyear / days_in_month / daysinmonth - hour / microsecond / minute / nanosecond / second - week / weekday / weekday_name - month / quarter / year / weekofyear - date - ceil / floor / round - normalize

str¶

Returns a class to access pandas-like string column based methods through pandas UDFs

Available methods: - contains - startswith / endswitch - match - isalpha / isnumeric / isalnum / isdigit / isdecimal / isspace - islower / isupper / istitle - replace - repeat - join - pad - slice / slice_replace - strip / lstrip / rstrip - wrap / center / ljust / rjust - translate - get - normalize - lower / upper / capitalize / swapcase / title - zfill - count - find / rfind - len

handyspark.sql.schema module¶

handyspark.sql.schema.generate_schema(colnames, coltypes, nullables=None)[source]¶

Parameters:	colnames (list of string) – coltypes (list of type) – nullables (list of boolean, optional) –
Returns:	schema – Spark DataFrame schema corresponding to Python/numpy types.
Return type:	StructType

handyspark.sql.string module¶

class handyspark.sql.string.HandyString(df, colname)[source]¶

Bases: object

remove_accents()[source]¶

handyspark.sql.transform module¶

class handyspark.sql.transform.HandyTransform[source]¶

Bases: object

static apply(sdf, f, name=None, args=None, returnType=None)[source]¶

static assign(sdf, **kwargs)[source]¶

static gen_grouped_pandas_udf(sdf, f, args=None, returnType=None)[source]¶

static gen_pandas_udf(f, args=None, returnType=None)[source]¶

static transform(sdf, f, name=None, args=None, returnType=None)[source]¶

Module contents¶

class handyspark.sql.HandyFrame(df, handy=None)[source]¶

Bases: pyspark.sql.dataframe.DataFrame

HandySpark version of DataFrame.

cols¶: HandyColumns – class to access pandas-like column based methods implemented in Spark

pandas¶: HandyPandas – class to access pandas-like column based methods through pandas UDFs

transformers¶: HandyTransformers – class to generate Handy transformers

stages¶: integer – number of stages in the execution plan

response¶: string – name of the response column

is_classification¶: boolean – True if response is a categorical variable

classes¶: list – list of classes for a classification problem

nclasses¶: integer – number of classes for a classification problem

ncols¶: integer – number of columns of the HandyFrame

nrows¶: integer – number of rows of the HandyFrame

shape¶: tuple – tuple representing dimensionality of the HandyFrame

statistics_¶: dict – imputation fill value for each feature If stratified, first level keys are filter clauses for stratification

fences_¶: dict – fence values for each feature If stratified, first level keys are filter clauses for stratification

is_stratified¶: boolean – True if HandyFrame was stratified

values¶: ndarray – Numpy representation of HandyFrame.

Available methods

- notHandy: makes it a plain Spark dataframe

- stratify: used to perform stratified operations

- isnull: checks for missing values

- fill: fills missing values

- outliers: checks for outliers

- fence: fences outliers

- set_safety_limit: defines new safety limit for collect operations

- safety_off: disables safety limit for a single operation

- assign: appends a new columns based on an expression

- nunique: returns number of unique values in each column

- set_response: sets column to be used as response / label

- disassemble: turns a vector / array column into multiple columns

- to_metrics_RDD: turns probability and label columns into a tuple RDD

apply(f, name=None, args=None, returnType=None)[source]¶: INTERNAL USE

assign(**kwargs)[source]¶

Assign new columns to a HandyFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

Parameters:	kwargs (keyword, value pairs) – keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. If the values are not callable, (e.g. a scalar, or string), they are simply assigned.
Returns:	df – A new HandyFrame with the new columns in addition to all the existing columns.
Return type:	HandyFrame

classes: Returns list of classes for a classification problem.

collect()[source]¶

Returns all the records as a list of Row.

By default, its output is limited by the safety limit. To get original collect behavior, call safety_off method first.

cols

Returns a class to access pandas-like column based methods implemented in Spark

Available methods: - min - max - median - q1 - q3 - stddev - value_counts - mode - corr - nunique - hist - boxplot - scatterplot

disassemble(colname, new_colnames=None)[source]¶

Disassembles a Vector or Array column into multiple columns.

Parameters:	colname (string) – Column containing Vector or Array elements. new_colnames (list of string, optional) – Default is None, column names are generated using a sequentially generated suffix (e.g., _0, _1, etc.) for `colname`. If informed, it must have as many column names as elements in the shortest vector/array of `colname`.
Returns:	df – A new HandyFrame with the new disassembled columns in addition to all the existing columns.
Return type:	HandyFrame

fence(colnames, k=1.5)[source]¶

Caps outliers using lower and upper fences given by Tukey’s method, using 1.5 times the interquartile range (IQR).

The fence values used for capping outliers are kept in fences_ property and can later be used to generate a corresponding HandyFencer transformer.

For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey’s_fences

Parameters:	colnames (list of string) – Column names to apply fencing. k (float, optional) – Constant multiplier for the IQR. Default is 1.5 (corresponding to Tukey’s outlier, use 3 for “far out” values)
Returns:	df – A new HandyFrame with capped outliers.
Return type:	HandyFrame

fences_: Returns dictionary with fence values for each feature. If stratified, first level keys are filter clauses for stratification.

fill(*args, categorical=None, continuous=None, strategy=None)[source]¶

Fill NA/NaN values using the specified methods.

The values used for imputation are kept in statistics_ property and can later be used to generate a corresponding HandyImputer transformer.

Parameters:	categorical ('all' or list of string, optional) – List of categorical columns. These columns are filled with its coresponding modes (most common values). continuous ('all' or list of string, optional) – List of continuous value columns. By default, these columns are filled with its corresponding means. If a same-sized list is provided in the `strategy` argument, it uses the corresponding straegy for each column. strategy (list of string, optional) – If informed, it must contain a strategy - either `mean` or `median` - for each one of the continuous columns.
Returns:	df – A new HandyFrame with filled missing values.
Return type:	HandyFrame

is_classification: Returns True if response is a categorical variable.

isnull(ratio=False)[source]¶

Returns array with counts of missing value for each column in the HandyFrame.

Parameters:	ratio (boolean, default False) – If True, returns ratios instead of absolute counts.
Returns:	counts
Return type:	Series

nclasses: Returns the number of classes for a classification problem.

ncols: Returns the number of columns of the HandyFrame.

notHandy()[source]¶: Converts HandyFrame back into Spark’s DataFrame

nrows: Returns the number of rows of the HandyFrame.

nunique()[source]¶

Return Series with number of distinct observations for all columns.

Returns:	nunique
Return type:	Series

outliers(ratio=False, method='tukey', **kwargs)[source]¶

Return Series with number of outlier observations according to the specified method for all columns.

Parameters:	ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True. method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns:	outliers
Return type:	Series

pandas

Returns a class to access pandas-like column based methods through pandas UDFs

Available methods: - betweeen / between_time - isin - isna / isnull - notna / notnull - abs - clip / clip_lower / clip_upper - replace - round / truncate - tz_convert / tz_localize

response: Returns the name of the response column.

safety_off()[source]¶: Disables safety limit for a single call of collect method.

set_response(colname)[source]¶

Sets column to be used as response in supervised learning algorithms.

Parameters:	colname (string) –
Returns:
Return type:	self

set_safety_limit(limit)[source]¶: Sets safety limit used for collect method.

shape: Return a tuple representing the dimensionality of the HandyFrame.

stages: Returns the number of stages in the execution plan.

statistics_: Returns dictionary with imputation fill value for each feature. If stratified, first level keys are filter clauses for stratification.

stratify(strata)[source]¶

Stratify the HandyFrame.

Stratified operations should be more efficient than group by operations, as they rely on three iterative steps, namely: filtering the underlying HandyFrame, performing the operation and aggregating the results.

take(num)[source]¶: Returns the first num rows as a list of Row.

to_metrics_RDD(prob_col='probability', label_col='label')[source]¶

Converts a DataFrame containing predicted probabilities and classification labels into a RDD suited for use with BinaryClassificationMetrics object.

Parameters:	prob_col (string, optional) – Column containing Vectors of probabilities. Default is ‘probability’. label_col (string, optional) – Column containing labels. Default is ‘label’.
Returns:	rdd – RDD of tuples (probability, label)
Return type:	RDD

transform(f, name=None, args=None, returnType=None)[source]¶: INTERNAL USE

transformers

Returns a class to generate Handy transformers

Available transformers: - HandyImputer - HandyFencer

values: Numpy representation of HandyFrame.

class handyspark.sql.Bucket(colname, bins=5)[source]¶

Bases: object

Bucketizes a column of continuous values into equal sized bins to perform stratification.

Parameters:	colname (string) – Column containing continuous values bins (integer) – Number of equal sized bins to map original values to.
Returns:	bucket – Bucket object to be used as column in stratification.
Return type:	Bucket

colname¶

class handyspark.sql.Quantile(colname, bins=5)[source]¶

Bases: handyspark.sql.dataframe.Bucket

Bucketizes a column of continuous values into quantiles to perform stratification.

Parameters:	colname (string) – Column containing continuous values bins (integer) – Number of quantiles to map original values to.
Returns:	quantile – Quantile object to be used as column in stratification.
Return type:	Quantile

handyspark.sql.generate_schema(colnames, coltypes, nullables=None)[source]¶

Parameters:	colnames (list of string) – coltypes (list of type) – nullables (list of boolean, optional) –
Returns:	schema – Spark DataFrame schema corresponding to Python/numpy types.
Return type:	StructType