handyspark.sql package¶
Submodules¶
handyspark.sql.dataframe module¶
-
class
handyspark.sql.dataframe.
Bucket
(colname, bins=5)[source]¶ Bases:
object
Bucketizes a column of continuous values into equal sized bins to perform stratification.
Parameters: - colname (string) – Column containing continuous values
- bins (integer) – Number of equal sized bins to map original values to.
Returns: bucket – Bucket object to be used as column in stratification.
Return type: -
colname
¶
-
class
handyspark.sql.dataframe.
Handy
(df)[source]¶ Bases:
object
-
classes
¶
-
fences_
¶
-
is_classification
¶
-
nclasses
¶
-
ncols
¶
-
nrows
¶
-
response
¶
-
shape
¶
-
stages
¶
-
statistics_
¶
-
strata
¶
-
-
class
handyspark.sql.dataframe.
HandyColumns
(df, handy, strata=None)[source]¶ Bases:
object
HandyColumn(s) in a HandyFrame.
-
numerical
¶ list of string – List of numerical columns (integer, float, double)
-
categorical
¶ list of string – List of categorical columns (string, integer)
-
continuous
¶ list of string – List of continous columns (float, double)
-
string
¶ list of string – List of string columns (string)
-
array
¶ list of string – List of array columns (array, map)
-
array
Returns list of array or map columns in the HandyFrame.
-
boxplot
(ax=None, showfliers=True, k=1.5)[source]¶ Makes a box plot from HandyFrame column.
Parameters:
-
categorical
Returns list of categorical columns in the HandyFrame.
-
continuous
Returns list of continuous columns in the HandyFrame.
-
corr
(method='pearson')[source]¶ Compute pairwise correlation of columns, excluding NA/null values.
Parameters: method ({'pearson', 'spearman'}) – - pearson : standard correlation coefficient
- spearman : Spearman rank correlation
Returns: y Return type: DataFrame
-
hist
(bins=10, ax=None)[source]¶ Draws histogram of the HandyFrame’s column using matplotlib / pylab.
Parameters: - bins (integer, default 10) – Number of histogram bins to be used
- ax (matplotlib axes object, default None) –
-
mode
()[source]¶ Returns same-type modal (most common) value for each column.
Returns: mode Return type: Series
-
numerical
Returns list of numerical columns in the HandyFrame.
-
nunique
()[source]¶ Return Series with number of distinct observations for specified columns.
Returns: nunique Return type: Series
-
outliers
(ratio=False, method='tukey')[source]¶ Return Series with number of outlier observations according to the specified method for all columns.
Parameters: - ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True.
- method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns: outliers
Return type: Series
-
scatterplot
(ax=None)[source]¶ Makes a scatter plot of two HandyFrame columns.
Parameters: ax (matplotlib axes object, default None) –
-
string
Returns list of string columns in the HandyFrame.
-
value_counts
(dropna=True)[source]¶ Returns object containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Parameters: dropna (boolean, default True) – Don’t include counts of missing values. Returns: counts Return type: Series
-
-
class
handyspark.sql.dataframe.
HandyFrame
(df, handy=None)[source]¶ Bases:
pyspark.sql.dataframe.DataFrame
HandySpark version of DataFrame.
-
cols
¶ HandyColumns – class to access pandas-like column based methods implemented in Spark
-
pandas
¶ HandyPandas – class to access pandas-like column based methods through pandas UDFs
-
transformers
¶ HandyTransformers – class to generate Handy transformers
-
stages
¶ integer – number of stages in the execution plan
-
response
¶ string – name of the response column
-
is_classification
¶ boolean – True if response is a categorical variable
-
classes
¶ list – list of classes for a classification problem
-
nclasses
¶ integer – number of classes for a classification problem
-
ncols
¶ integer – number of columns of the HandyFrame
-
nrows
¶ integer – number of rows of the HandyFrame
-
shape
¶ tuple – tuple representing dimensionality of the HandyFrame
-
statistics_
¶ dict – imputation fill value for each feature If stratified, first level keys are filter clauses for stratification
-
fences_
¶ dict – fence values for each feature If stratified, first level keys are filter clauses for stratification
-
is_stratified
¶ boolean – True if HandyFrame was stratified
-
values
¶ ndarray – Numpy representation of HandyFrame.
-
Available methods
-
- notHandy
makes it a plain Spark dataframe
-
- stratify
used to perform stratified operations
-
- isnull
checks for missing values
-
- fill
fills missing values
-
- outliers
checks for outliers
-
- fence
fences outliers
-
- set_safety_limit
defines new safety limit for collect operations
-
- safety_off
disables safety limit for a single operation
-
- assign
appends a new columns based on an expression
-
- nunique
returns number of unique values in each column
-
- set_response
sets column to be used as response / label
-
- disassemble
turns a vector / array column into multiple columns
-
- to_metrics_RDD
turns probability and label columns into a tuple RDD
-
assign
(**kwargs)[source]¶ Assign new columns to a HandyFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
Parameters: kwargs (keyword, value pairs) – keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. If the values are not callable, (e.g. a scalar, or string), they are simply assigned. Returns: df – A new HandyFrame with the new columns in addition to all the existing columns. Return type: HandyFrame
-
classes
Returns list of classes for a classification problem.
-
collect
()[source]¶ Returns all the records as a list of
Row
.By default, its output is limited by the safety limit. To get original collect behavior, call
safety_off
method first.
-
cols
Returns a class to access pandas-like column based methods implemented in Spark
Available methods: - min - max - median - q1 - q3 - stddev - value_counts - mode - corr - nunique - hist - boxplot - scatterplot
-
disassemble
(colname, new_colnames=None)[source]¶ Disassembles a Vector or Array column into multiple columns.
Parameters: - colname (string) – Column containing Vector or Array elements.
- new_colnames (list of string, optional) – Default is None, column names are generated using a sequentially
generated suffix (e.g., _0, _1, etc.) for
colname
. If informed, it must have as many column names as elements in the shortest vector/array ofcolname
.
Returns: df – A new HandyFrame with the new disassembled columns in addition to all the existing columns.
Return type:
-
fence
(colnames, k=1.5)[source]¶ Caps outliers using lower and upper fences given by Tukey’s method, using 1.5 times the interquartile range (IQR).
The fence values used for capping outliers are kept in
fences_
property and can later be used to generate a corresponding HandyFencer transformer.For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey’s_fences
Parameters: - colnames (list of string) – Column names to apply fencing.
- k (float, optional) – Constant multiplier for the IQR. Default is 1.5 (corresponding to Tukey’s outlier, use 3 for “far out” values)
Returns: df – A new HandyFrame with capped outliers.
Return type:
-
fences_
Returns dictionary with fence values for each feature. If stratified, first level keys are filter clauses for stratification.
-
fill
(*args, categorical=None, continuous=None, strategy=None)[source]¶ Fill NA/NaN values using the specified methods.
The values used for imputation are kept in
statistics_
property and can later be used to generate a corresponding HandyImputer transformer.Parameters: - categorical ('all' or list of string, optional) – List of categorical columns. These columns are filled with its coresponding modes (most common values).
- continuous ('all' or list of string, optional) – List of continuous value columns.
By default, these columns are filled with its corresponding means.
If a same-sized list is provided in the
strategy
argument, it uses the corresponding straegy for each column. - strategy (list of string, optional) – If informed, it must contain a strategy - either
mean
ormedian
- for each one of the continuous columns.
Returns: df – A new HandyFrame with filled missing values.
Return type:
-
is_classification
Returns True if response is a categorical variable.
-
isnull
(ratio=False)[source]¶ Returns array with counts of missing value for each column in the HandyFrame.
Parameters: ratio (boolean, default False) – If True, returns ratios instead of absolute counts. Returns: counts Return type: Series
-
nclasses
Returns the number of classes for a classification problem.
-
ncols
Returns the number of columns of the HandyFrame.
-
nrows
Returns the number of rows of the HandyFrame.
-
nunique
()[source]¶ Return Series with number of distinct observations for all columns.
Returns: nunique Return type: Series
-
outliers
(ratio=False, method='tukey', **kwargs)[source]¶ Return Series with number of outlier observations according to the specified method for all columns.
Parameters: - ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True.
- method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns: outliers
Return type: Series
-
pandas
Returns a class to access pandas-like column based methods through pandas UDFs
Available methods: - betweeen / between_time - isin - isna / isnull - notna / notnull - abs - clip / clip_lower / clip_upper - replace - round / truncate - tz_convert / tz_localize
-
response
Returns the name of the response column.
-
set_response
(colname)[source]¶ Sets column to be used as response in supervised learning algorithms.
Parameters: colname (string) – Returns: Return type: self
-
shape
Return a tuple representing the dimensionality of the HandyFrame.
-
stages
Returns the number of stages in the execution plan.
-
statistics_
Returns dictionary with imputation fill value for each feature. If stratified, first level keys are filter clauses for stratification.
-
stratify
(strata)[source]¶ Stratify the HandyFrame.
Stratified operations should be more efficient than group by operations, as they rely on three iterative steps, namely: filtering the underlying HandyFrame, performing the operation and aggregating the results.
-
to_metrics_RDD
(prob_col='probability', label_col='label')[source]¶ Converts a DataFrame containing predicted probabilities and classification labels into a RDD suited for use with
BinaryClassificationMetrics
object.Parameters: Returns: rdd – RDD of tuples (probability, label)
Return type: RDD
-
transformers
Returns a class to generate Handy transformers
Available transformers: - HandyImputer - HandyFencer
-
values
Numpy representation of HandyFrame.
-
-
class
handyspark.sql.dataframe.
HandyGrouped
(jgd, df, *args)[source]¶ Bases:
pyspark.sql.group.GroupedData
-
class
handyspark.sql.dataframe.
Quantile
(colname, bins=5)[source]¶ Bases:
handyspark.sql.dataframe.Bucket
Bucketizes a column of continuous values into quantiles to perform stratification.
Parameters: - colname (string) – Column containing continuous values
- bins (integer) – Number of quantiles to map original values to.
Returns: quantile – Quantile object to be used as column in stratification.
Return type:
handyspark.sql.datetime module¶
handyspark.sql.pandas module¶
-
class
handyspark.sql.pandas.
HandyPandas
(df)[source]¶ Bases:
object
-
dt
¶ Returns a class to access pandas-like datetime column based methods through pandas UDFs
Available methods: - is_leap_year / is_month_end / is_month_start / is_quarter_end / is_quarter_start / is_year_end / is_year_start - strftime - tz / time / tz_convert / tz_localize - day / dayofweek / dayofyear / days_in_month / daysinmonth - hour / microsecond / minute / nanosecond / second - week / weekday / weekday_name - month / quarter / year / weekofyear - date - ceil / floor / round - normalize
-
str
¶ Returns a class to access pandas-like string column based methods through pandas UDFs
Available methods: - contains - startswith / endswitch - match - isalpha / isnumeric / isalnum / isdigit / isdecimal / isspace - islower / isupper / istitle - replace - repeat - join - pad - slice / slice_replace - strip / lstrip / rstrip - wrap / center / ljust / rjust - translate - get - normalize - lower / upper / capitalize / swapcase / title - zfill - count - find / rfind - len
-
handyspark.sql.schema module¶
handyspark.sql.string module¶
handyspark.sql.transform module¶
Module contents¶
-
class
handyspark.sql.
HandyFrame
(df, handy=None)[source]¶ Bases:
pyspark.sql.dataframe.DataFrame
HandySpark version of DataFrame.
-
cols
¶ HandyColumns – class to access pandas-like column based methods implemented in Spark
-
pandas
¶ HandyPandas – class to access pandas-like column based methods through pandas UDFs
-
transformers
¶ HandyTransformers – class to generate Handy transformers
-
stages
¶ integer – number of stages in the execution plan
-
response
¶ string – name of the response column
-
is_classification
¶ boolean – True if response is a categorical variable
-
classes
¶ list – list of classes for a classification problem
-
nclasses
¶ integer – number of classes for a classification problem
-
ncols
¶ integer – number of columns of the HandyFrame
-
nrows
¶ integer – number of rows of the HandyFrame
-
shape
¶ tuple – tuple representing dimensionality of the HandyFrame
-
statistics_
¶ dict – imputation fill value for each feature If stratified, first level keys are filter clauses for stratification
-
fences_
¶ dict – fence values for each feature If stratified, first level keys are filter clauses for stratification
-
is_stratified
¶ boolean – True if HandyFrame was stratified
-
values
¶ ndarray – Numpy representation of HandyFrame.
-
Available methods
-
- notHandy
makes it a plain Spark dataframe
-
- stratify
used to perform stratified operations
-
- isnull
checks for missing values
-
- fill
fills missing values
-
- outliers
checks for outliers
-
- fence
fences outliers
-
- set_safety_limit
defines new safety limit for collect operations
-
- safety_off
disables safety limit for a single operation
-
- assign
appends a new columns based on an expression
-
- nunique
returns number of unique values in each column
-
- set_response
sets column to be used as response / label
-
- disassemble
turns a vector / array column into multiple columns
-
- to_metrics_RDD
turns probability and label columns into a tuple RDD
-
assign
(**kwargs)[source]¶ Assign new columns to a HandyFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
Parameters: kwargs (keyword, value pairs) – keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. If the values are not callable, (e.g. a scalar, or string), they are simply assigned. Returns: df – A new HandyFrame with the new columns in addition to all the existing columns. Return type: HandyFrame
-
classes
Returns list of classes for a classification problem.
-
collect
()[source]¶ Returns all the records as a list of
Row
.By default, its output is limited by the safety limit. To get original collect behavior, call
safety_off
method first.
-
cols
Returns a class to access pandas-like column based methods implemented in Spark
Available methods: - min - max - median - q1 - q3 - stddev - value_counts - mode - corr - nunique - hist - boxplot - scatterplot
-
disassemble
(colname, new_colnames=None)[source]¶ Disassembles a Vector or Array column into multiple columns.
Parameters: - colname (string) – Column containing Vector or Array elements.
- new_colnames (list of string, optional) – Default is None, column names are generated using a sequentially
generated suffix (e.g., _0, _1, etc.) for
colname
. If informed, it must have as many column names as elements in the shortest vector/array ofcolname
.
Returns: df – A new HandyFrame with the new disassembled columns in addition to all the existing columns.
Return type:
-
fence
(colnames, k=1.5)[source]¶ Caps outliers using lower and upper fences given by Tukey’s method, using 1.5 times the interquartile range (IQR).
The fence values used for capping outliers are kept in
fences_
property and can later be used to generate a corresponding HandyFencer transformer.For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey’s_fences
Parameters: - colnames (list of string) – Column names to apply fencing.
- k (float, optional) – Constant multiplier for the IQR. Default is 1.5 (corresponding to Tukey’s outlier, use 3 for “far out” values)
Returns: df – A new HandyFrame with capped outliers.
Return type:
-
fences_
Returns dictionary with fence values for each feature. If stratified, first level keys are filter clauses for stratification.
-
fill
(*args, categorical=None, continuous=None, strategy=None)[source]¶ Fill NA/NaN values using the specified methods.
The values used for imputation are kept in
statistics_
property and can later be used to generate a corresponding HandyImputer transformer.Parameters: - categorical ('all' or list of string, optional) – List of categorical columns. These columns are filled with its coresponding modes (most common values).
- continuous ('all' or list of string, optional) – List of continuous value columns.
By default, these columns are filled with its corresponding means.
If a same-sized list is provided in the
strategy
argument, it uses the corresponding straegy for each column. - strategy (list of string, optional) – If informed, it must contain a strategy - either
mean
ormedian
- for each one of the continuous columns.
Returns: df – A new HandyFrame with filled missing values.
Return type:
-
is_classification
Returns True if response is a categorical variable.
-
isnull
(ratio=False)[source]¶ Returns array with counts of missing value for each column in the HandyFrame.
Parameters: ratio (boolean, default False) – If True, returns ratios instead of absolute counts. Returns: counts Return type: Series
-
nclasses
Returns the number of classes for a classification problem.
-
ncols
Returns the number of columns of the HandyFrame.
-
nrows
Returns the number of rows of the HandyFrame.
-
nunique
()[source]¶ Return Series with number of distinct observations for all columns.
Returns: nunique Return type: Series
-
outliers
(ratio=False, method='tukey', **kwargs)[source]¶ Return Series with number of outlier observations according to the specified method for all columns.
Parameters: - ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True.
- method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns: outliers
Return type: Series
-
pandas
Returns a class to access pandas-like column based methods through pandas UDFs
Available methods: - betweeen / between_time - isin - isna / isnull - notna / notnull - abs - clip / clip_lower / clip_upper - replace - round / truncate - tz_convert / tz_localize
-
response
Returns the name of the response column.
-
set_response
(colname)[source]¶ Sets column to be used as response in supervised learning algorithms.
Parameters: colname (string) – Returns: Return type: self
-
shape
Return a tuple representing the dimensionality of the HandyFrame.
-
stages
Returns the number of stages in the execution plan.
-
statistics_
Returns dictionary with imputation fill value for each feature. If stratified, first level keys are filter clauses for stratification.
-
stratify
(strata)[source]¶ Stratify the HandyFrame.
Stratified operations should be more efficient than group by operations, as they rely on three iterative steps, namely: filtering the underlying HandyFrame, performing the operation and aggregating the results.
-
to_metrics_RDD
(prob_col='probability', label_col='label')[source]¶ Converts a DataFrame containing predicted probabilities and classification labels into a RDD suited for use with
BinaryClassificationMetrics
object.Parameters: Returns: rdd – RDD of tuples (probability, label)
Return type: RDD
-
transformers
Returns a class to generate Handy transformers
Available transformers: - HandyImputer - HandyFencer
-
values
Numpy representation of HandyFrame.
-
-
class
handyspark.sql.
Bucket
(colname, bins=5)[source]¶ Bases:
object
Bucketizes a column of continuous values into equal sized bins to perform stratification.
Parameters: - colname (string) – Column containing continuous values
- bins (integer) – Number of equal sized bins to map original values to.
Returns: bucket – Bucket object to be used as column in stratification.
Return type: -
colname
¶
-
class
handyspark.sql.
Quantile
(colname, bins=5)[source]¶ Bases:
handyspark.sql.dataframe.Bucket
Bucketizes a column of continuous values into quantiles to perform stratification.
Parameters: - colname (string) – Column containing continuous values
- bins (integer) – Number of quantiles to map original values to.
Returns: quantile – Quantile object to be used as column in stratification.
Return type: