Predicting Breast Cancer Recurrence Time
In this post we will build a model for predicting cancer recurrence time with Linear Regression in Python based on a real dataset. Following the previous post about the concept of linear regression, let’s build a machine learning model that takes real data on breast cancer patients (using features like the patient’s age, the size of the tumor, etc.) and tries to predict the time it takes for the cancer to come back. Because the answer to the problem will be a continuous number (an undefined number of months), we will use linear regression as the algorithm.
Follow this Example on Colab
You can follow the step by step code and the visualizations on the Colab notebook here. You can also try running this code with a few tweaks or a different data source on google Colab by making a copy and editing it.
Step 1: Choosing the Data Set
The dataset that we will use to build the linear regression model is the Breast Cancer Wisconsin (Prognostic) Data Set from the UC Irvine Machine Learning Repository. The dataset contains 34 attributes whose description can be found along with the dataset. Our aim is to build a model which utilizes the attributes in the dataset to predict the time (in months) to recurrence of the Cancer disease. Thus, for the model we are trying to build, the elements that we mentioned in the Linear regression explanation are:
Linear Regression Model Elements
- Predictors or Independent Variables or Inputs – The attributes present in the dataset apart from (ID, Time, Outcome) like Radius Mean, Area Mean, Perimeter Mean etc.
- Response or Dependent Variable or Output – The time (in months) after which the disease recurred.
- Residual – The difference in the time after which the disease actually recurred (real value) and the time predicted by the model after which the disease will recur (predicted value of the dependent variable).
- Weights or Coefficients – These are calculated to determine the line that best “fits” the data. The coefficients will be computed by minimizing the MSE.
- Intercept or Bias – It helps offset the effects of missing relevant predictors for the response and helps make the mean of the residuals 0. The intercept or bias acts as the default value for the function i.e. when all independent values are zero. In this context, the intercept will ensure that the mean of the residuals will be zero and it accounts for missing relevant attributes for predicting time to recurrence of cancer.
The Linear Regression hypothesis for such a task would look like:
Time (in months) = W0+W1∙Radius Mean + W2∙Texture Mean + W3∙Perimeter Mean + …
where we determine the weights (coefficients) to generate the best fit line.
We will use the Python programming language which has a plethora of libraries for data science. Let’s start by loading the dataset into a dataframe provided by the Pandas library, a data analysis and manipulation library.
raw_dataset = pd.read_csv(dataset_path, names=column_names, na_values = "?", sep=",") dataset = raw_dataset.copy()
Step 2: Cleaning the Data
Next, we drop rows with NA values using dataset.dropna() and the rows where the disease has not recurred, in this case, dropping each one using dataset.drop(‘var name’, 1).
dataset = dataset.dropna() dataset = dataset[dataset['Outcome'] != 'N'] dataset = dataset.drop('Outcome', 1) dataset = dataset.drop('ID', 1)
To view the descriptive stats for each variable, we can use dataset.describe():
Before we use the attributes to build the model, let’s do some feature engineering!
Step 3: Feature Engineering
Every time we talk about Feature Engineering we are referring to the process of preparing the input data for the model (the predictors or in this case
the Radius Mean, Area Mean, Perimeter Mean), to be compatible with the conditions for our models. There are different techniques to optimize this process (we will see it in another post in detail), but in our example here, we will just apply scaling to our data.
To start, let’s do an overview about the correlation among attributes. Correlation indicates the strength of the relationship between two features. If the correlation is close to 1 or -1 there is a strong association among variables, and if is closer to 0 the relationship is weak.
The positive values for correlations mean that the relationship is direct, so if one variable increases the other also has the same behavior, and vice versa. But if the correlation is negative, the relationship is inverse, that means, if one variable increases the other one decreases, and if the first one decreases the second one increases.
Now, to find the correlations among all the pairs of variables, we are using a popular Python visualization library called seaborn to generate a heatmap based on the correlation (i.e. how the attributes associate with each other) between the columns in the dataframe dataset.
The lightest colors, show the strongest correlation (positive or negative), while the darker colors show a weak correlation. On inspecting the correlation visualization we see that there are various attributes that are strongly correlated to each other. Take for example the Radius Mean and the Area Mean, that have 0.99 correlation (almost 1). That makes sense, because the area is directly related to the radius, so if the radius increases, then so does the area.
What do we do with these highly correlated variables? If we have strongly correlated predictors in our model, it could introduce a problem of multicollinearity which further has a negative impact on the accuracy of the model.
Multicollinearity makes reference to the strong correlation among different attributes or inputs in a model. This will have a negative impact on the model’s accuracy and it’s important to solve the multicollinearity problem before we adjust the model. Therefore, we select one attribute at random among the strongly correlated pairs and drop the rest of the attributes. It’s always good to check if the logic of the high correlation between the variables makes sense, i.e. the Radius and Area Mean in our case.
drop_list = ['Texture Mean', 'Perimeter Mean', 'Area Mean', 'Compactness Mean', 'Smoothness Mean', 'Concavity Mean', 'Symmetry Mean', 'Radius SE','Perimeter SE', 'Area SE', 'Texture SE', 'Compactness SE', 'Smoothness SE', 'Concavity SE', 'Concave Points SE', 'Fractal Dimension SE', 'Symmetry SE','Area Worst','Perimeter Worst','Compactness Worst', 'Concavity Worst', 'Fractal Dimension Worst', 'Symmetry Worst','Concave Points Worst','Texture Worst','Lymph Node Status', 'Tumor Size'] feature_engineer_dataset = dataset.drop(drop_list,axis = 1 ) feature_engineer_dataset.head()
We now have 6 predictors (inputs) after dropping the highly correlated attributes above.
The next step is standardizing using the sklearn-pandas library. Standardizing refers transforming the variables independent (or attributes) in a way that they distribute like a normal standard distribution that is, the mean of the attribute to be 0 and the standard deviation to 1.
labels = feature_engineer_dataset.pop('Time') mapper = DataFrameMapper([(feature_engineer_dataset.columns, StandardScaler())]) scaled_features = mapper.fit_transform(feature_engineer_dataset.copy(), 4) scaled_features_df = pd.DataFrame(scaled_features, index=feature_engineer_dataset.index, columns=feature_engineer_dataset.columns)
Step 4: Split Data into Training and Testing
We now construct the training set and the testing set with a split of 80% for training set and 20% for the test set. We use the Linear Regression algorithm provided by the scikit-learn, a machine learning library for building the model and consequently fit the data to our training data.
Step 5: Running the Linear Regression Model in Python
This function will build the Linear Regression model.
def build_and_evaluate_model(x, y): x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42) regr = linear_model.LinearRegression() regr.fit(x_train, y_train) y_pred = regr.predict(x_test) print('Coefficients: \n', regr.coef_) print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred)) print("The labels: \n", np.array(y_test).astype(int)) print("The predicted values are: \n", y_pred.astype(int))
Let’s now analyze it and see how it performs!
Step 6: Evaluating the Model
The coefficients along with their sign denote how the response varies with change in the predictor. A positive value for a coefficient means that there is a direct linear relationship between predictor and response (input and output). That is, if the predictor increases, the value for the response increases and if the predictor decreases, then the response also decreases. However, a negative value for the coefficient refers an inverse linear relationship between predictor and response. That means, in this case, if predictor increases the response decreases and vice versa.
On the other hand, the proportion for the changes is given by the magnitude of the coefficient that corresponds to a mean change in the response for each unit of change in the predictor.
In order to decide on the performance for the model, we built another two models to compare to our final model. The process to build these models is the same as the one we have built. The results are shown below:
|#||Model Description||Mean Squared Error (MSE)|
|1.||A baseline model||4872.55|
|2.||A model with a single attribute as the predictor. The predictor chosen was Radius Mean since it is highly correlated (negative correlation) with the response – Time .||223.75|
|3.||A model with basic feature engineering involved. The process for building such a model has been described in this blog post.||280.71|
The importance of selecting features is evident from the results in the table above.
One way to compare the models for linear regression, is using the Mean Squared Error (MSE). This refers to the expected value of the error loss squared. Intuitively this gives us an idea about how big the mean of the differences is between the predicted values and the real values. So, the smaller the MSE, the better prediction and model.
As we can see, all the models have a high MSE and this is an indication that the model does not perform particularly well. This could be explained by the following reasons:
- The number of data points in the dataset is low.
- There is no strong linear relationship between the predictors and the response.
The code for this article was provided by Yash Mathur. Thanks for this great example Yash!
Linear Regression, Gradient Descent, Regularization and more
For a deeper comprehension about how linear regression works and clear examples of this algorithm in machine learning (with technical explanations like gradient descent and regularization), enroll in the “Learn AI with an AI Course” with Audrey Durand.
Machine Learning in Healthcare Series
In this series of articles we explore the use of machine learning in the healthcare industry. Important concepts and algorithms are covered with real applications including open datasets, open code (available on Github/Colab) used to run the analyses and the final results and insights.
Subscribe to the Korbit Newsletter
Stay up to date with news, blog posts, code examples and new courses. The newsletter goes out every month.