handyspark package¶

Submodules¶

handyspark.plot module¶

handyspark.plot.boxplot(sdf, colnames, ax=None, showfliers=True, k=1.5)[source]¶

handyspark.plot.consolidate_plots(fig, axs, title, clauses)[source]¶

handyspark.plot.correlations(sdf, colnames, method='pearson', ax=None, plot=True)[source]¶

handyspark.plot.draw_boxplot(ax, stats)[source]¶

handyspark.plot.histogram(sdf, colname, bins=10, categorical=False, ax=None)[source]¶

handyspark.plot.post_boxplot(axs, stats, clauses)[source]¶

handyspark.plot.scatterplot(sdf, col1, col2, n=30, ax=None)[source]¶

handyspark.plot.strat_histogram(sdf, colname, bins=10, categorical=False)[source]¶

handyspark.plot.strat_scatterplot(sdf, col1, col2, n=30)[source]¶

handyspark.plot.stratified_histogram(sdf, colname, strat_colname, strat_values, ax=None)[source]¶

handyspark.plot.title_fom_clause(clause)[source]¶

handyspark.stats module¶

handyspark.stats.KolmogorovSmirnovTest(sdf, colname, dist='normal', *params)[source]¶: Performs a KolmogorovSmirnov test for comparing the distribution of values in a column to a named canonical distribution.

handyspark.stats.StatisticalSummaryValues(sdf, colnames)[source]¶: Builds a Java StatisticalSummaryValues object for each column

handyspark.stats.distribution(sdf, colname)[source]¶

handyspark.stats.entropy(sdf, colnames)[source]¶

handyspark.stats.mahalanobis(sdf, colnames)[source]¶: Computes Mahalanobis distance from origin and compares to critical values using Chi-Squared distribution to identify possible outliers.

handyspark.stats.mutual_info(sdf, colnames)[source]¶

handyspark.stats.tTest(jvm, *ssvs)[source]¶: Performs a t-Test for difference of means using StatisticalSummaryValues objects

handyspark.util module¶

exception handyspark.util.HandyException(*args, **kwargs)[source]¶

Bases: Exception

static colortext(text, color_code)[source]¶

static errortext(text)[source]¶

static exception_summary()[source]¶

class handyspark.util.bcolors[source]¶

Bases: object

BOLD = '\x1b[1m'¶

ENDC = '\x1b[0m'¶

FAIL = '\x1b[91m'¶

HEADER = '\x1b[95m'¶

OKBLUE = '\x1b[94m'¶

OKGREEN = '\x1b[92m'¶

UNDERLINE = '\x1b[4m'¶

WARNING = '\x1b[93m'¶

handyspark.util.call_scala_method(py_class, scala_method, df, *args)[source]¶: Given a Python class, calls a method from its Scala equivalent

handyspark.util.check_columns(df, colnames)[source]¶

handyspark.util.counts_to_df(value_counts, colnames, n_points)[source]¶: DO NOT USE IT!

handyspark.util.dense_to_array(sdf, colname, new_colname)[source]¶: Casts a Vector column into a new Array column.

handyspark.util.disassemble(sdf, colname, new_colnames=None)[source]¶: Disassembles a Vector/Array column into multiple columns

handyspark.util.ensure_list(value)[source]¶

handyspark.util.get_buckets(rdd, buckets)[source]¶: Extracted from pyspark.rdd.RDD.histogram function

handyspark.util.get_jvm_class(cl)[source]¶: Builds JVM class name from Python class

handyspark.util.none2default(value, default)[source]¶

handyspark.util.none2zero(value)[source]¶

Module contents¶

class handyspark.HandyFrame(df, handy=None)[source]¶

Bases: pyspark.sql.dataframe.DataFrame

HandySpark version of DataFrame.

cols¶: HandyColumns – class to access pandas-like column based methods implemented in Spark

pandas¶: HandyPandas – class to access pandas-like column based methods through pandas UDFs

transformers¶: HandyTransformers – class to generate Handy transformers

stages¶: integer – number of stages in the execution plan

response¶: string – name of the response column

is_classification¶: boolean – True if response is a categorical variable

classes¶: list – list of classes for a classification problem

nclasses¶: integer – number of classes for a classification problem

ncols¶: integer – number of columns of the HandyFrame

nrows¶: integer – number of rows of the HandyFrame

shape¶: tuple – tuple representing dimensionality of the HandyFrame

statistics_¶: dict – imputation fill value for each feature If stratified, first level keys are filter clauses for stratification

fences_¶: dict – fence values for each feature If stratified, first level keys are filter clauses for stratification

is_stratified¶: boolean – True if HandyFrame was stratified

values¶: ndarray – Numpy representation of HandyFrame.

Available methods

- notHandy: makes it a plain Spark dataframe

- stratify: used to perform stratified operations

- isnull: checks for missing values

- fill: fills missing values

- outliers: checks for outliers

- fence: fences outliers

- set_safety_limit: defines new safety limit for collect operations

- safety_off: disables safety limit for a single operation

- assign: appends a new columns based on an expression

- nunique: returns number of unique values in each column

- set_response: sets column to be used as response / label

- disassemble: turns a vector / array column into multiple columns

- to_metrics_RDD: turns probability and label columns into a tuple RDD

apply(f, name=None, args=None, returnType=None)[source]¶: INTERNAL USE

assign(**kwargs)[source]¶

Assign new columns to a HandyFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

Parameters:	kwargs (keyword, value pairs) – keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. If the values are not callable, (e.g. a scalar, or string), they are simply assigned.
Returns:	df – A new HandyFrame with the new columns in addition to all the existing columns.
Return type:	HandyFrame

classes: Returns list of classes for a classification problem.

collect()[source]¶

Returns all the records as a list of Row.

By default, its output is limited by the safety limit. To get original collect behavior, call safety_off method first.

cols

Returns a class to access pandas-like column based methods implemented in Spark

Available methods: - min - max - median - q1 - q3 - stddev - value_counts - mode - corr - nunique - hist - boxplot - scatterplot

disassemble(colname, new_colnames=None)[source]¶

Disassembles a Vector or Array column into multiple columns.

Parameters:	colname (string) – Column containing Vector or Array elements. new_colnames (list of string, optional) – Default is None, column names are generated using a sequentially generated suffix (e.g., _0, _1, etc.) for `colname`. If informed, it must have as many column names as elements in the shortest vector/array of `colname`.
Returns:	df – A new HandyFrame with the new disassembled columns in addition to all the existing columns.
Return type:	HandyFrame

fence(colnames, k=1.5)[source]¶

Caps outliers using lower and upper fences given by Tukey’s method, using 1.5 times the interquartile range (IQR).

The fence values used for capping outliers are kept in fences_ property and can later be used to generate a corresponding HandyFencer transformer.

For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey’s_fences

Parameters:	colnames (list of string) – Column names to apply fencing. k (float, optional) – Constant multiplier for the IQR. Default is 1.5 (corresponding to Tukey’s outlier, use 3 for “far out” values)
Returns:	df – A new HandyFrame with capped outliers.
Return type:	HandyFrame

fences_: Returns dictionary with fence values for each feature. If stratified, first level keys are filter clauses for stratification.

fill(*args, categorical=None, continuous=None, strategy=None)[source]¶

Fill NA/NaN values using the specified methods.

The values used for imputation are kept in statistics_ property and can later be used to generate a corresponding HandyImputer transformer.

Parameters:	categorical ('all' or list of string, optional) – List of categorical columns. These columns are filled with its coresponding modes (most common values). continuous ('all' or list of string, optional) – List of continuous value columns. By default, these columns are filled with its corresponding means. If a same-sized list is provided in the `strategy` argument, it uses the corresponding straegy for each column. strategy (list of string, optional) – If informed, it must contain a strategy - either `mean` or `median` - for each one of the continuous columns.
Returns:	df – A new HandyFrame with filled missing values.
Return type:	HandyFrame

is_classification: Returns True if response is a categorical variable.

isnull(ratio=False)[source]¶

Returns array with counts of missing value for each column in the HandyFrame.

Parameters:	ratio (boolean, default False) – If True, returns ratios instead of absolute counts.
Returns:	counts
Return type:	Series

nclasses: Returns the number of classes for a classification problem.

ncols: Returns the number of columns of the HandyFrame.

notHandy()[source]¶: Converts HandyFrame back into Spark’s DataFrame

nrows: Returns the number of rows of the HandyFrame.

nunique()[source]¶

Return Series with number of distinct observations for all columns.

Returns:	nunique
Return type:	Series

outliers(ratio=False, method='tukey', **kwargs)[source]¶

Return Series with number of outlier observations according to the specified method for all columns.

Parameters:	ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True. method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns:	outliers
Return type:	Series

pandas

Returns a class to access pandas-like column based methods through pandas UDFs

Available methods: - betweeen / between_time - isin - isna / isnull - notna / notnull - abs - clip / clip_lower / clip_upper - replace - round / truncate - tz_convert / tz_localize

response: Returns the name of the response column.

safety_off()[source]¶: Disables safety limit for a single call of collect method.

set_response(colname)[source]¶

Sets column to be used as response in supervised learning algorithms.

Parameters:	colname (string) –
Returns:
Return type:	self

set_safety_limit(limit)[source]¶: Sets safety limit used for collect method.

shape: Return a tuple representing the dimensionality of the HandyFrame.

stages: Returns the number of stages in the execution plan.

statistics_: Returns dictionary with imputation fill value for each feature. If stratified, first level keys are filter clauses for stratification.

stratify(strata)[source]¶

Stratify the HandyFrame.

Stratified operations should be more efficient than group by operations, as they rely on three iterative steps, namely: filtering the underlying HandyFrame, performing the operation and aggregating the results.

take(num)[source]¶: Returns the first num rows as a list of Row.

to_metrics_RDD(prob_col='probability', label_col='label')[source]¶

Converts a DataFrame containing predicted probabilities and classification labels into a RDD suited for use with BinaryClassificationMetrics object.

Parameters:	prob_col (string, optional) – Column containing Vectors of probabilities. Default is ‘probability’. label_col (string, optional) – Column containing labels. Default is ‘label’.
Returns:	rdd – RDD of tuples (probability, label)
Return type:	RDD

transform(f, name=None, args=None, returnType=None)[source]¶: INTERNAL USE

transformers

Returns a class to generate Handy transformers

Available transformers: - HandyImputer - HandyFencer

values: Numpy representation of HandyFrame.

class handyspark.Bucket(colname, bins=5)[source]¶

Bases: object

Bucketizes a column of continuous values into equal sized bins to perform stratification.

Parameters:	colname (string) – Column containing continuous values bins (integer) – Number of equal sized bins to map original values to.
Returns:	bucket – Bucket object to be used as column in stratification.
Return type:	Bucket

colname¶

class handyspark.Quantile(colname, bins=5)[source]¶

Bases: handyspark.sql.dataframe.Bucket

Bucketizes a column of continuous values into quantiles to perform stratification.

Parameters:	colname (string) – Column containing continuous values bins (integer) – Number of quantiles to map original values to.
Returns:	quantile – Quantile object to be used as column in stratification.
Return type:	Quantile

class handyspark.BinaryClassificationMetrics(scoreAndLabels)[source]¶

Bases: pyspark.mllib.common.JavaModelWrapper

Evaluator for binary classification.

Parameters:	scoreAndLabels – an RDD of (score, label) pairs

>>> scoreAndLabels = sc.parallelize([
...     (0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)], 2)
>>> metrics = BinaryClassificationMetrics(scoreAndLabels)
>>> metrics.areaUnderROC
0.70...
>>> metrics.areaUnderPR
0.83...
>>> metrics.unpersist()

New in version 1.4.0.

areaUnderPR¶: Computes the area under the precision-recall curve.

New in version 1.4.0.

areaUnderROC¶: Computes the area under the receiver operating characteristic (ROC) curve.

New in version 1.4.0.

confusionMatrix(threshold=0.5)¶

Returns confusion matrix: predicted classes are in columns, they are ordered by class label ascending, as in “labels”.

Predicted classes are computed according to informed threshold.

Parameters:	threshold (double, optional) – Threshold probability for the positive class. Default is 0.5.
Returns:	confusionMatrix
Return type:	DenseMatrix

fMeasureByThreshold(beta=1.0)¶

Calls the fMeasureByThreshold method from the Java class

Returns the (threshold, F-Measure) curve.
@param beta the beta factor in F-Measure computation.
@return an RDD of (threshold, F-Measure) pairs.
@see <a href=”http://en.wikipedia.org/wiki/F1_score”>F1 score (Wikipedia)</a>

getMetricsByThreshold()¶

pr()¶

Calls the pr method from the Java class

Returns the precision-recall curve, which is an RDD of (recall, precision),
NOT (precision, recall), with (0.0, p) prepended to it, where p is the precision
associated with the lowest recall on the curve.
@see <a href=”http://en.wikipedia.org/wiki/Precision_and_recall”>
Precision and recall (Wikipedia)</a>

precisionByThreshold()¶

Calls the precisionByThreshold method from the Java class

Returns the (threshold, precision) curve.

recallByThreshold()¶

Calls the recallByThreshold method from the Java class

Returns the (threshold, recall) curve.

roc()¶

Calls the roc method from the Java class

Returns the receiver operating characteristic (ROC) curve,
which is an RDD of (false positive rate, true positive rate)
with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
@see <a href=”http://en.wikipedia.org/wiki/Receiver_operating_characteristic”>
Receiver operating characteristic (Wikipedia)</a>

thresholds()¶

Returns thresholds in descending order.

unpersist()[source]¶: Unpersists intermediate RDDs used in the computation.

New in version 1.4.0.