handyspark.extensions package

Submodules

handyspark.extensions.common module

handyspark.extensions.common.call2(self, name, *a)[source]

Another call method for JavaModelWrapper. This method should be used whenever the JavaModel returns a Scala Tuple that needs to be deserialized before converted to Python.

handyspark.extensions.evaluation module

handyspark.extensions.evaluation.confusionMatrix(self, threshold=0.5)[source]

Returns confusion matrix: predicted classes are in columns, they are ordered by class label ascending, as in “labels”.

Predicted classes are computed according to informed threshold.

Parameters:threshold (double, optional) – Threshold probability for the positive class. Default is 0.5.
Returns:confusionMatrix
Return type:DenseMatrix
handyspark.extensions.evaluation.fMeasureByThreshold(self, beta=1.0)[source]

Calls the fMeasureByThreshold method from the Java class

  • Returns the (threshold, F-Measure) curve.
  • @param beta the beta factor in F-Measure computation.
  • @return an RDD of (threshold, F-Measure) pairs.
  • @see <a href=”http://en.wikipedia.org/wiki/F1_score”>F1 score (Wikipedia)</a>
handyspark.extensions.evaluation.getMetricsByThreshold(self)[source]
handyspark.extensions.evaluation.pr(self)[source]

Calls the pr method from the Java class

  • Returns the precision-recall curve, which is an RDD of (recall, precision),
  • NOT (precision, recall), with (0.0, p) prepended to it, where p is the precision
  • associated with the lowest recall on the curve.
  • @see <a href=”http://en.wikipedia.org/wiki/Precision_and_recall”>
  • Precision and recall (Wikipedia)</a>
handyspark.extensions.evaluation.precisionByThreshold(self)[source]

Calls the precisionByThreshold method from the Java class

  • Returns the (threshold, precision) curve.
handyspark.extensions.evaluation.recallByThreshold(self)[source]

Calls the recallByThreshold method from the Java class

  • Returns the (threshold, recall) curve.
handyspark.extensions.evaluation.roc(self)[source]

Calls the roc method from the Java class

  • Returns the receiver operating characteristic (ROC) curve,
  • which is an RDD of (false positive rate, true positive rate)
  • with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
  • @see <a href=”http://en.wikipedia.org/wiki/Receiver_operating_characteristic”>
  • Receiver operating characteristic (Wikipedia)</a>
handyspark.extensions.evaluation.thresholds(self)[source]
  • Returns thresholds in descending order.

handyspark.extensions.types module

handyspark.extensions.types.ret(self, expr)[source]

Assigns a return type to the expression when used inside an assign method.

Module contents

class handyspark.extensions.BinaryClassificationMetrics(scoreAndLabels)[source]

Bases: pyspark.mllib.common.JavaModelWrapper

Evaluator for binary classification.

Parameters:scoreAndLabels – an RDD of (score, label) pairs
>>> scoreAndLabels = sc.parallelize([
...     (0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)], 2)
>>> metrics = BinaryClassificationMetrics(scoreAndLabels)
>>> metrics.areaUnderROC
0.70...
>>> metrics.areaUnderPR
0.83...
>>> metrics.unpersist()

New in version 1.4.0.

areaUnderPR

Computes the area under the precision-recall curve.

New in version 1.4.0.

areaUnderROC

Computes the area under the receiver operating characteristic (ROC) curve.

New in version 1.4.0.

confusionMatrix(threshold=0.5)

Returns confusion matrix: predicted classes are in columns, they are ordered by class label ascending, as in “labels”.

Predicted classes are computed according to informed threshold.

Parameters:threshold (double, optional) – Threshold probability for the positive class. Default is 0.5.
Returns:confusionMatrix
Return type:DenseMatrix
fMeasureByThreshold(beta=1.0)

Calls the fMeasureByThreshold method from the Java class

  • Returns the (threshold, F-Measure) curve.
  • @param beta the beta factor in F-Measure computation.
  • @return an RDD of (threshold, F-Measure) pairs.
  • @see <a href=”http://en.wikipedia.org/wiki/F1_score”>F1 score (Wikipedia)</a>
getMetricsByThreshold()
pr()

Calls the pr method from the Java class

  • Returns the precision-recall curve, which is an RDD of (recall, precision),
  • NOT (precision, recall), with (0.0, p) prepended to it, where p is the precision
  • associated with the lowest recall on the curve.
  • @see <a href=”http://en.wikipedia.org/wiki/Precision_and_recall”>
  • Precision and recall (Wikipedia)</a>
precisionByThreshold()

Calls the precisionByThreshold method from the Java class

  • Returns the (threshold, precision) curve.
recallByThreshold()

Calls the recallByThreshold method from the Java class

  • Returns the (threshold, recall) curve.
roc()

Calls the roc method from the Java class

  • Returns the receiver operating characteristic (ROC) curve,
  • which is an RDD of (false positive rate, true positive rate)
  • with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
  • @see <a href=”http://en.wikipedia.org/wiki/Receiver_operating_characteristic”>
  • Receiver operating characteristic (Wikipedia)</a>
thresholds()
  • Returns thresholds in descending order.
unpersist()[source]

Unpersists intermediate RDDs used in the computation.

New in version 1.4.0.