Predicting Cancer Recurrence Outcome in Python

Predicting Breast Cancer Recurrence Outcome

In this post we will build a model for predicting cancer recurrence outcome with Logistic Regression in Python based on a real data set. Following the previous post about the concept of logistic regression (classification), let’s build a machine learning model that takes real data on breast cancer patients (using features like the patient’s age, the size of the tumor, etc.) and tries to predict whether or not the cancer will come back. Because the answer to the problem will be discrete (yes or no), we will use logistic regression as the algorithm.

Follow this Example on Colab

You can follow the step-by-step code and the visualizations on our public google Colab notebook here. You can also try running this code with a few tweaks or a different data source on google colab. If you’re new to google Colab, you can follow the step-by-step guide here.

Predicting cancer recurrence outcome in Python

Step 1: Choosing the Data Set

The data set that we will use to build the logistic regression model is the Breast Cancer Wisconsin (Prognostic) Data Set from the UC Irvine Machine Learning Repository. It is the same data set that we used in our previous post to build the linear regression model to predict the cancer’s time (in months) to recurrence. Thus, for the model we are trying to build, the elements are:

Logistic Regression Model Elements

  1. Predictors or Independent Variables or Inputs – The attributes present in the data set apart from (ID, Time, Outcome) like Radius Mean, Area Mean, Perimeter Mean etc.
  2. Response or Dependent Variable or Output – The time (in months) after which the disease recurred.
  3. Residual – The difference in the time after which the disease actually recurred (real value) and the time predicted by the model after which the disease will recur (predicted value of the dependent variable).
  4. Weights or Coefficients – These are calculated to determine the line that best “fits” the data. The coefficients will be computed by using the cross-entropy function.
  5. Intercept or Bias – It helps offset the effects of missing relevant predictors for the response and helps make the mean of the residuals 0. The intercept or bias acts as the default value for the function i.e. when all independent values are zero.

The hypothesis will would look like:

Outcome (binary) = W0+W1∙Radius Mean + W2∙Texture Mean + W3∙Perimeter Mean + …

where we determine the weights (coefficients) to generate the best predictions.

We will use the Python programming language which has a plethora of libraries for data science. Let’s start by loading the data set into a data frame provided by the Pandas library, a data analysis and manipulation library.

raw_dataset = pd.read_csv(dataset_path, names=column_names, na_values = "?", sep=",")
dataset = raw_dataset.copy()

What the data looks like after loading it into a Pandas data frame.

Step 2: Cleaning the Data

We will now remove data which is not useful to our model such as rows with NA values and columns such as ID and Time. We also remove and store the target labels which are present in the Outcome column. This step leaves us with 194 rows of data with 32 attributes each.

dataset = dataset.dropna()y = dataset['Outcome']
drop_list = ['ID', 'Time', 'Outcome']
dataset = dataset.drop(drop_list, 1) 

Pandas library offers us an ability to view key statistical data for each column/attribute in a data frame. Let’s have a look at these statistics:

Step 3: Feature Engineering

As in the Linear Regression in Healthcare post we perform feature engineering before building and evaluating our model. Since we are using the same attributes as in the Linear Regression example, we’ll use the same approach. To understand how this feature engineering is done please refer to the post above.

Step 4: Split Data into Training and Testing

Let’s construct the training set and the testing set with a split of 80% for the training set and 20% for the test set. We use the Logistic Regression algorithm provided by the scikit-learn, a machine learning library for building the model and consequently fit the data to our training data. We will also use a new metric for the evaluation of the model that is known as a confusion matrix. 

Step 5: Running the Logistic Regression Model in Python

This function will build the Logistic Regression model.

def build_and_evaluate_model(x, y):    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    regr = linear_model.LogisticRegression(solver = "lbfgs", max_iter = 2000)
    regr.fit(x_train, y_train)
    y_pred = regr.predict(x_test)
    accuracy = regr.score(x_test, y_test)
    print("The accuracy of the model is " , accuracy * 100)
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm,annot=True,fmt="d") 

Let’s now analyze it and see how it performs!

Step 6: Evaluating the Model

If you take a closer look at the code above, you’ll notice we created a confusion matrix. This matrix will help us evaluate our model. Essentially, the confusion matrix gives us an overview of all the correct and incorrect cases classified by our model. This is what a general confusion matrix for a classifier trying to categorize among two classes (0 and 1) looks like this:


Actual Value of Label:
0
Actual Value of Label:
1
Predicted Value of Label:
0
True NegativeFalse Negative
Predicted Value of Label:
1
False PositiveTrue Positive

Now, let’s see the confusion matrix for our model in action and let’s analyze its performance.

The confusion matrix for the feature engineered model can be interpreted as follows. In 32 cases the model correctly predicted that the disease will not recur. In 4 cases the model thought the disease would recur but it was wrong. There were no instances where the model predicted the disease would not recur when it actually did. There were 3 instances where the model thought the disease would recur and it actually did.

The accuracy of a model can be computed directly from the confusion matrix with this formula:

The table below compares the performance of the model we built against two other basic models to highlight the importance of feature engineering:

#.Model DescriptionAccuracy (in %)
1.A baseline model84.61
2.A model with a single attribute as the predictor. The predictor chosen was Radius Mean.82.05
3.A model with basic feature engineering involved. The process for building such a model has been described in the Linear Regression blog post.89.74

The code for this article was provided by Yash Mathur. Thanks for this great example Yash!

†Logistic Regression, Gradient Descent, Regularization and more

In this post we saw how Logistic regression works and how it can be used to predict whether or not cancer will recur in a patient. The code for the model is available here. For a deeper comprehension about how logistic regression works and clear examples of this algorithm in machine learning (with technical explanations like gradient descent and regularization), enroll in the free “Learn AI with an AI Course” with Audrey Durand.

Machine Learning in Healthcare Series

In this series of articles we explore the use of machine learning in the healthcare industry. Important concepts and algorithms are covered with real applications including open data sets, open code (available on Github) used to run the analyses and the final results and insights.

Subscribe to the Korbit Newsletter

Stay up to date with news, blog posts, code examples and new courses. The newsletter goes out every month.

Series Articles