handyspark package¶
Submodules¶
handyspark.plot module¶
handyspark.stats module¶
-
handyspark.stats.KolmogorovSmirnovTest(sdf, colname, dist='normal', *params)[source]¶ Performs a KolmogorovSmirnov test for comparing the distribution of values in a column to a named canonical distribution.
-
handyspark.stats.StatisticalSummaryValues(sdf, colnames)[source]¶ Builds a Java StatisticalSummaryValues object for each column
handyspark.util module¶
-
class
handyspark.util.bcolors[source]¶ Bases:
object-
BOLD= '\x1b[1m'¶
-
ENDC= '\x1b[0m'¶
-
FAIL= '\x1b[91m'¶
-
HEADER= '\x1b[95m'¶
-
OKBLUE= '\x1b[94m'¶
-
OKGREEN= '\x1b[92m'¶
-
UNDERLINE= '\x1b[4m'¶
-
WARNING= '\x1b[93m'¶
-
-
handyspark.util.call_scala_method(py_class, scala_method, df, *args)[source]¶ Given a Python class, calls a method from its Scala equivalent
-
handyspark.util.dense_to_array(sdf, colname, new_colname)[source]¶ Casts a Vector column into a new Array column.
-
handyspark.util.disassemble(sdf, colname, new_colnames=None)[source]¶ Disassembles a Vector/Array column into multiple columns
Module contents¶
-
class
handyspark.HandyFrame(df, handy=None)[source]¶ Bases:
pyspark.sql.dataframe.DataFrameHandySpark version of DataFrame.
-
cols¶ HandyColumns – class to access pandas-like column based methods implemented in Spark
-
pandas¶ HandyPandas – class to access pandas-like column based methods through pandas UDFs
-
transformers¶ HandyTransformers – class to generate Handy transformers
-
stages¶ integer – number of stages in the execution plan
-
response¶ string – name of the response column
-
is_classification¶ boolean – True if response is a categorical variable
-
classes¶ list – list of classes for a classification problem
-
nclasses¶ integer – number of classes for a classification problem
-
ncols¶ integer – number of columns of the HandyFrame
-
nrows¶ integer – number of rows of the HandyFrame
-
shape¶ tuple – tuple representing dimensionality of the HandyFrame
-
statistics_¶ dict – imputation fill value for each feature If stratified, first level keys are filter clauses for stratification
-
fences_¶ dict – fence values for each feature If stratified, first level keys are filter clauses for stratification
-
is_stratified¶ boolean – True if HandyFrame was stratified
-
values¶ ndarray – Numpy representation of HandyFrame.
-
Available methods
-
- notHandy makes it a plain Spark dataframe
-
- stratify used to perform stratified operations
-
- isnull checks for missing values
-
- fill fills missing values
-
- outliers checks for outliers
-
- fence fences outliers
-
- set_safety_limit defines new safety limit for collect operations
-
- safety_off disables safety limit for a single operation
-
- assign appends a new columns based on an expression
-
- nunique returns number of unique values in each column
-
- set_response sets column to be used as response / label
-
- disassemble turns a vector / array column into multiple columns
-
- to_metrics_RDD turns probability and label columns into a tuple RDD
-
assign(**kwargs)[source]¶ Assign new columns to a HandyFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
Parameters: kwargs (keyword, value pairs) – keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. If the values are not callable, (e.g. a scalar, or string), they are simply assigned. Returns: df – A new HandyFrame with the new columns in addition to all the existing columns. Return type: HandyFrame
-
classes Returns list of classes for a classification problem.
-
collect()[source]¶ Returns all the records as a list of
Row.By default, its output is limited by the safety limit. To get original collect behavior, call
safety_offmethod first.
-
cols Returns a class to access pandas-like column based methods implemented in Spark
Available methods: - min - max - median - q1 - q3 - stddev - value_counts - mode - corr - nunique - hist - boxplot - scatterplot
-
disassemble(colname, new_colnames=None)[source]¶ Disassembles a Vector or Array column into multiple columns.
Parameters: - colname (string) – Column containing Vector or Array elements.
- new_colnames (list of string, optional) – Default is None, column names are generated using a sequentially
generated suffix (e.g., _0, _1, etc.) for
colname. If informed, it must have as many column names as elements in the shortest vector/array ofcolname.
Returns: df – A new HandyFrame with the new disassembled columns in addition to all the existing columns.
Return type:
-
fence(colnames, k=1.5)[source]¶ Caps outliers using lower and upper fences given by Tukey’s method, using 1.5 times the interquartile range (IQR).
The fence values used for capping outliers are kept in
fences_property and can later be used to generate a corresponding HandyFencer transformer.For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey’s_fences
Parameters: - colnames (list of string) – Column names to apply fencing.
- k (float, optional) – Constant multiplier for the IQR. Default is 1.5 (corresponding to Tukey’s outlier, use 3 for “far out” values)
Returns: df – A new HandyFrame with capped outliers.
Return type:
-
fences_ Returns dictionary with fence values for each feature. If stratified, first level keys are filter clauses for stratification.
-
fill(*args, categorical=None, continuous=None, strategy=None)[source]¶ Fill NA/NaN values using the specified methods.
The values used for imputation are kept in
statistics_property and can later be used to generate a corresponding HandyImputer transformer.Parameters: - categorical ('all' or list of string, optional) – List of categorical columns. These columns are filled with its coresponding modes (most common values).
- continuous ('all' or list of string, optional) – List of continuous value columns.
By default, these columns are filled with its corresponding means.
If a same-sized list is provided in the
strategyargument, it uses the corresponding straegy for each column. - strategy (list of string, optional) – If informed, it must contain a strategy - either
meanormedian- for each one of the continuous columns.
Returns: df – A new HandyFrame with filled missing values.
Return type:
-
is_classification Returns True if response is a categorical variable.
-
isnull(ratio=False)[source]¶ Returns array with counts of missing value for each column in the HandyFrame.
Parameters: ratio (boolean, default False) – If True, returns ratios instead of absolute counts. Returns: counts Return type: Series
-
nclasses Returns the number of classes for a classification problem.
-
ncols Returns the number of columns of the HandyFrame.
-
nrows Returns the number of rows of the HandyFrame.
-
nunique()[source]¶ Return Series with number of distinct observations for all columns.
Returns: nunique Return type: Series
-
outliers(ratio=False, method='tukey', **kwargs)[source]¶ Return Series with number of outlier observations according to the specified method for all columns.
Parameters: - ratio (boolean, optional) – If True, returns proportion instead of counts. Default is True.
- method (string, optional) – Method used to detect outliers. Currently, only Tukey’s method is supported. Default is tukey.
Returns: outliers
Return type: Series
-
pandas Returns a class to access pandas-like column based methods through pandas UDFs
Available methods: - betweeen / between_time - isin - isna / isnull - notna / notnull - abs - clip / clip_lower / clip_upper - replace - round / truncate - tz_convert / tz_localize
-
response Returns the name of the response column.
-
set_response(colname)[source]¶ Sets column to be used as response in supervised learning algorithms.
Parameters: colname (string) – Returns: Return type: self
-
shape Return a tuple representing the dimensionality of the HandyFrame.
-
stages Returns the number of stages in the execution plan.
-
statistics_ Returns dictionary with imputation fill value for each feature. If stratified, first level keys are filter clauses for stratification.
-
stratify(strata)[source]¶ Stratify the HandyFrame.
Stratified operations should be more efficient than group by operations, as they rely on three iterative steps, namely: filtering the underlying HandyFrame, performing the operation and aggregating the results.
-
to_metrics_RDD(prob_col='probability', label_col='label')[source]¶ Converts a DataFrame containing predicted probabilities and classification labels into a RDD suited for use with
BinaryClassificationMetricsobject.Parameters: Returns: rdd – RDD of tuples (probability, label)
Return type: RDD
-
transformers Returns a class to generate Handy transformers
Available transformers: - HandyImputer - HandyFencer
-
values Numpy representation of HandyFrame.
-
-
class
handyspark.Bucket(colname, bins=5)[source]¶ Bases:
objectBucketizes a column of continuous values into equal sized bins to perform stratification.
Parameters: - colname (string) – Column containing continuous values
- bins (integer) – Number of equal sized bins to map original values to.
Returns: bucket – Bucket object to be used as column in stratification.
Return type: -
colname¶
-
class
handyspark.Quantile(colname, bins=5)[source]¶ Bases:
handyspark.sql.dataframe.BucketBucketizes a column of continuous values into quantiles to perform stratification.
Parameters: - colname (string) – Column containing continuous values
- bins (integer) – Number of quantiles to map original values to.
Returns: quantile – Quantile object to be used as column in stratification.
Return type:
-
class
handyspark.BinaryClassificationMetrics(scoreAndLabels)[source]¶ Bases:
pyspark.mllib.common.JavaModelWrapperEvaluator for binary classification.
Parameters: scoreAndLabels – an RDD of (score, label) pairs >>> scoreAndLabels = sc.parallelize([ ... (0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)], 2) >>> metrics = BinaryClassificationMetrics(scoreAndLabels) >>> metrics.areaUnderROC 0.70... >>> metrics.areaUnderPR 0.83... >>> metrics.unpersist()
New in version 1.4.0.
-
areaUnderPR¶ Computes the area under the precision-recall curve.
New in version 1.4.0.
-
areaUnderROC¶ Computes the area under the receiver operating characteristic (ROC) curve.
New in version 1.4.0.
-
confusionMatrix(threshold=0.5)¶ Returns confusion matrix: predicted classes are in columns, they are ordered by class label ascending, as in “labels”.
Predicted classes are computed according to informed threshold.
Parameters: threshold (double, optional) – Threshold probability for the positive class. Default is 0.5. Returns: confusionMatrix Return type: DenseMatrix
-
fMeasureByThreshold(beta=1.0)¶ Calls the fMeasureByThreshold method from the Java class
- Returns the (threshold, F-Measure) curve.
- @param beta the beta factor in F-Measure computation.
- @return an RDD of (threshold, F-Measure) pairs.
- @see <a href=”http://en.wikipedia.org/wiki/F1_score”>F1 score (Wikipedia)</a>
-
getMetricsByThreshold()¶
-
pr()¶ Calls the pr method from the Java class
- Returns the precision-recall curve, which is an RDD of (recall, precision),
- NOT (precision, recall), with (0.0, p) prepended to it, where p is the precision
- associated with the lowest recall on the curve.
- @see <a href=”http://en.wikipedia.org/wiki/Precision_and_recall”>
- Precision and recall (Wikipedia)</a>
-
precisionByThreshold()¶ Calls the precisionByThreshold method from the Java class
- Returns the (threshold, precision) curve.
-
recallByThreshold()¶ Calls the recallByThreshold method from the Java class
- Returns the (threshold, recall) curve.
-
roc()¶ Calls the roc method from the Java class
- Returns the receiver operating characteristic (ROC) curve,
- which is an RDD of (false positive rate, true positive rate)
- with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
- @see <a href=”http://en.wikipedia.org/wiki/Receiver_operating_characteristic”>
- Receiver operating characteristic (Wikipedia)</a>
-
thresholds()¶ - Returns thresholds in descending order.
-