handyspark package

Submodules

handyspark.plot module

handyspark.plot.boxplot(sdf, colnames, ax=None, showfliers=True, k=1.5)[source]
handyspark.plot.consolidate_plots(fig, axs, title, clauses)[source]
handyspark.plot.correlations(sdf, colnames, method='pearson', ax=None, plot=True)[source]
handyspark.plot.draw_boxplot(ax, stats)[source]
handyspark.plot.histogram(sdf, colname, bins=10, categorical=False, ax=None)[source]
handyspark.plot.post_boxplot(axs, stats, clauses)[source]
handyspark.plot.scatterplot(sdf, col1, col2, n=30, ax=None)[source]
handyspark.plot.strat_histogram(sdf, colname, bins=10, categorical=False)[source]
handyspark.plot.strat_scatterplot(sdf, col1, col2, n=30)[source]
handyspark.plot.stratified_histogram(sdf, colname, strat_colname, strat_values, ax=None)[source]
handyspark.plot.title_fom_clause(clause)[source]

handyspark.stats module

handyspark.stats.KolmogorovSmirnovTest(sdf, colname, dist='normal', *params)[source]

Performs a KolmogorovSmirnov test for comparing the distribution of values in a column to a named canonical distribution.

handyspark.stats.StatisticalSummaryValues(sdf, colnames)[source]

Builds a Java StatisticalSummaryValues object for each column

handyspark.stats.distribution(sdf, colname)[source]
handyspark.stats.entropy(sdf, colnames)[source]
handyspark.stats.mahalanobis(sdf, colnames)[source]

Computes Mahalanobis distance from origin and compares to critical values using Chi-Squared distribution to identify possible outliers.

handyspark.stats.mutual_info(sdf, colnames)[source]
handyspark.stats.tTest(jvm, *ssvs)[source]

Performs a t-Test for difference of means using StatisticalSummaryValues objects

handyspark.util module

exception handyspark.util.HandyException(*args, **kwargs)[source]

Bases: Exception

static colortext(text, color_code)[source]
static errortext(text)[source]
static exception_summary()[source]
class handyspark.util.bcolors[source]

Bases: object

BOLD = '\x1b[1m'
ENDC = '\x1b[0m'
FAIL = '\x1b[91m'
HEADER = '\x1b[95m'
OKBLUE = '\x1b[94m'
OKGREEN = '\x1b[92m'
UNDERLINE = '\x1b[4m'
WARNING = '\x1b[93m'
handyspark.util.call_scala_method(py_class, scala_method, df, *args)[source]

Given a Python class, calls a method from its Scala equivalent

handyspark.util.check_columns(df, colnames)[source]
handyspark.util.counts_to_df(value_counts, colnames, n_points)[source]

DO NOT USE IT!

handyspark.util.dense_to_array(sdf, colname, new_colname)[source]

Casts a Vector column into a new Array column.

handyspark.util.disassemble(sdf, colname, new_colnames=None)[source]

Disassembles a Vector/Array column into multiple columns

handyspark.util.ensure_list(value)[source]
handyspark.util.get_buckets(rdd, buckets)[source]

Extracted from pyspark.rdd.RDD.histogram function

handyspark.util.get_jvm_class(cl)[source]

Builds JVM class name from Python class

handyspark.util.none2default(value, default)[source]
handyspark.util.none2zero(value)[source]

Module contents

class handyspark.HandyFrame(df, handy=None)[source]

Bases: pyspark.sql.dataframe.DataFrame

HandySpark version of DataFrame.

cols

HandyColumns – class to access pandas-like column based methods implemented in Spark

pandas

HandyPandas – class to access pandas-like column based methods through pandas UDFs

transformers

HandyTransformers – class to generate Handy transformers

stages

integer – number of stages in the execution plan

response

string – name of the response column

is_classification

boolean – True if response is a categorical variable

classes

list – list of classes for a classification problem

nclasses

integer – number of classes for a classification problem

ncols

integer – number of columns of the HandyFrame

nrows

integer – number of rows of the HandyFrame

shape

tuple – tuple representing dimensionality of the HandyFrame

statistics_

dict – imputation fill value for each feature If stratified, first level keys are filter clauses for stratification

fences_

dict – fence values for each feature If stratified, first level keys are filter clauses for stratification

is_stratified

boolean – True if HandyFrame was stratified

values

ndarray – Numpy representation of HandyFrame.

Available methods
- notHandy

makes it a plain Spark dataframe

- stratify

used to perform stratified operations

- isnull

checks for missing values

- fill

fills missing values

- outliers

checks for outliers

- fence

fences outliers

- set_safety_limit

defines new safety limit for collect operations

- safety_off

disables safety limit for a single operation

- assign

appends a new columns based on an expression

- nunique

returns number of unique values in each column

- set_response

sets column to be used as response / label

- disassemble

turns a vector / array column into multiple columns

- to_metrics_RDD

turns probability and label columns into a tuple RDD

apply(f, name=None, args=None, returnType=None)[source]

INTERNAL USE

assign(**kwargs)[source]

Assign new columns to a HandyFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

Parameters:kwargs (keyword, value pairs) – keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. If the values are not callable, (e.g. a scalar, or string), they are simply assigned.
Returns:df – A new HandyFrame with the new columns in addition to all the existing columns.
Return type:HandyFrame
classes

Returns list of classes for a classification problem.

collect()[source]

Returns all the records as a list of Row.

By default, its output is limited by the safety limit. To get original collect behavior, call safety_off method first.

cols

Returns a class to access pandas-like column based methods implemented in Spark

Available methods: - min - max - median - q1 - q3 - stddev - value_counts - mode - corr - nunique - hist - boxplot - scatterplot

disassemble(colname, new_colnames=None)[source]

Disassembles a Vector or Array column into multiple columns.

Parameters:
  • colname (string) – Column containing Vector or Array elements.
  • new_colnames (list of string, optional) – Default is None, column names are generated using a sequentially generated suffix (e.g., _0, _1, etc.) for colname. If informed, it must have as many column names as elements in the shortest vector/array of colname.
Returns:

df – A new HandyFrame with the new disassembled columns in addition to all the existing columns.

Return type:

HandyFrame

fence(colnames, k=1.5)[source]

Caps outliers using lower and upper fences given by Tukey’s method, using 1.5 times the interquartile range (IQR).

The fence values used for capping outliers are kept in fences_ property and can later be used to generate a corresponding HandyFencer transformer.

For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey’s_fences

Parameters:
  • colnames (list of string) – Column names to apply fencing.
  • k (float, optional) – Constant multiplier for the IQR. Default is 1.5 (corresponding to Tukey’s outlier, use 3 for “far out” values)
Returns:

df – A new HandyFrame with capped outliers.

Return type:

HandyFrame

fences_

Returns dictionary with fence values for each feature. If stratified, first level keys are filter clauses for stratification.

fill(*args, categorical=None, continuous=None, strategy=None)[source]

Fill NA/NaN values using the specified methods.

The values used for imputation are kept in statistics_ property and can later be used to generate a corresponding HandyImputer transformer.

Parameters:
  • categorical ('all' or list of string, optional) – List of categorical columns. These columns are filled with its coresponding modes (most common values).
  • continuous ('all' or list of string, optional) – List of continuous value columns. By default, these columns are filled with its corresponding means. If a same-sized list is provided in the strategy argument, it uses the corresponding straegy for each column.
  • strategy (list of string, optional) – If informed, it must contain a strategy - either mean or median - for each one of the continuous columns.
Returns:

df – A new HandyFrame with filled missing values.

Return type:

HandyFrame

is_classification

Returns True if response is a categorical variable.

isnull(ratio=False)[source]

Returns array with counts of missing value for each column in the HandyFrame.

Parameters:ratio (boolean, default False) – If True, returns ratios instead of absolute counts.
Returns:counts
Return type:Series
nclasses

Returns the number of classes for a classification problem.

ncols

Returns the number of columns of the HandyFrame.

notHandy()[source]

Converts HandyFrame back into Spark’s DataFrame

nrows

Returns the number of rows of the HandyFrame.

nunique()[source]

Return Series with number of distinct observations for all columns.

Returns:nunique
Return type:Series
outliers(ratio=False, method='tukey', **kwargs)[source]

Return Series with number of outlier observations according to the specified method for all columns.

Parameters:
  • ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True.
  • method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns:

outliers

Return type:

Series

pandas

Returns a class to access pandas-like column based methods through pandas UDFs

Available methods: - betweeen / between_time - isin - isna / isnull - notna / notnull - abs - clip / clip_lower / clip_upper - replace - round / truncate - tz_convert / tz_localize

response

Returns the name of the response column.

safety_off()[source]

Disables safety limit for a single call of collect method.

set_response(colname)[source]

Sets column to be used as response in supervised learning algorithms.

Parameters:colname (string) –
Returns:
Return type:self
set_safety_limit(limit)[source]

Sets safety limit used for collect method.

shape

Return a tuple representing the dimensionality of the HandyFrame.

stages

Returns the number of stages in the execution plan.

statistics_

Returns dictionary with imputation fill value for each feature. If stratified, first level keys are filter clauses for stratification.

stratify(strata)[source]

Stratify the HandyFrame.

Stratified operations should be more efficient than group by operations, as they rely on three iterative steps, namely: filtering the underlying HandyFrame, performing the operation and aggregating the results.

take(num)[source]

Returns the first num rows as a list of Row.

to_metrics_RDD(prob_col='probability', label_col='label')[source]

Converts a DataFrame containing predicted probabilities and classification labels into a RDD suited for use with BinaryClassificationMetrics object.

Parameters:
  • prob_col (string, optional) – Column containing Vectors of probabilities. Default is ‘probability’.
  • label_col (string, optional) – Column containing labels. Default is ‘label’.
Returns:

rdd – RDD of tuples (probability, label)

Return type:

RDD

transform(f, name=None, args=None, returnType=None)[source]

INTERNAL USE

transformers

Returns a class to generate Handy transformers

Available transformers: - HandyImputer - HandyFencer

values

Numpy representation of HandyFrame.

class handyspark.Bucket(colname, bins=5)[source]

Bases: object

Bucketizes a column of continuous values into equal sized bins to perform stratification.

Parameters:
  • colname (string) – Column containing continuous values
  • bins (integer) – Number of equal sized bins to map original values to.
Returns:

bucket – Bucket object to be used as column in stratification.

Return type:

Bucket

colname
class handyspark.Quantile(colname, bins=5)[source]

Bases: handyspark.sql.dataframe.Bucket

Bucketizes a column of continuous values into quantiles to perform stratification.

Parameters:
  • colname (string) – Column containing continuous values
  • bins (integer) – Number of quantiles to map original values to.
Returns:

quantile – Quantile object to be used as column in stratification.

Return type:

Quantile

class handyspark.BinaryClassificationMetrics(scoreAndLabels)[source]

Bases: pyspark.mllib.common.JavaModelWrapper

Evaluator for binary classification.

Parameters:scoreAndLabels – an RDD of (score, label) pairs
>>> scoreAndLabels = sc.parallelize([
...     (0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)], 2)
>>> metrics = BinaryClassificationMetrics(scoreAndLabels)
>>> metrics.areaUnderROC
0.70...
>>> metrics.areaUnderPR
0.83...
>>> metrics.unpersist()

New in version 1.4.0.

areaUnderPR

Computes the area under the precision-recall curve.

New in version 1.4.0.

areaUnderROC

Computes the area under the receiver operating characteristic (ROC) curve.

New in version 1.4.0.

confusionMatrix(threshold=0.5)

Returns confusion matrix: predicted classes are in columns, they are ordered by class label ascending, as in “labels”.

Predicted classes are computed according to informed threshold.

Parameters:threshold (double, optional) – Threshold probability for the positive class. Default is 0.5.
Returns:confusionMatrix
Return type:DenseMatrix
fMeasureByThreshold(beta=1.0)

Calls the fMeasureByThreshold method from the Java class

  • Returns the (threshold, F-Measure) curve.
  • @param beta the beta factor in F-Measure computation.
  • @return an RDD of (threshold, F-Measure) pairs.
  • @see <a href=”http://en.wikipedia.org/wiki/F1_score”>F1 score (Wikipedia)</a>
getMetricsByThreshold()
pr()

Calls the pr method from the Java class

  • Returns the precision-recall curve, which is an RDD of (recall, precision),
  • NOT (precision, recall), with (0.0, p) prepended to it, where p is the precision
  • associated with the lowest recall on the curve.
  • @see <a href=”http://en.wikipedia.org/wiki/Precision_and_recall”>
  • Precision and recall (Wikipedia)</a>
precisionByThreshold()

Calls the precisionByThreshold method from the Java class

  • Returns the (threshold, precision) curve.
recallByThreshold()

Calls the recallByThreshold method from the Java class

  • Returns the (threshold, recall) curve.
roc()

Calls the roc method from the Java class

  • Returns the receiver operating characteristic (ROC) curve,
  • which is an RDD of (false positive rate, true positive rate)
  • with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
  • @see <a href=”http://en.wikipedia.org/wiki/Receiver_operating_characteristic”>
  • Receiver operating characteristic (Wikipedia)</a>
thresholds()
  • Returns thresholds in descending order.
unpersist()[source]

Unpersists intermediate RDDs used in the computation.

New in version 1.4.0.