Feature Importance Date Continuous Car Sklearn
Introduction
When we fit a supervised machine learning (ML) model, we often want to understand which features are most associated with our outcome of interest. Features that are highly associated with the outcome are considered more "important." In this article, we'll introduce you to the concept of feature importance through a discussion of:
- Tree-based feature importance
- Gini impurity
- Implementation in scikit-learn
- Other methods for estimating feature importance
Feature importance in an ML workflow
There are many reasons why we might be interested in calculating feature importances as part of our machine learning workflow. For example:
- Feature importance is often used for dimensionality reduction.
- We can use it as a filter method to remove irrelevant features from our model and only retain the ones that are most highly associated with our outcome of interest.
- Wrapper methods such as recursive feature elimination use feature importance to more efficiently search the feature space for a model.
- Feature importance may also be used for model inspection and communication. For example, stakeholders may be interested in understanding which features are most important for prediction. Feature importance can help us answer this question.
Calculating feature importance
There are many different ways to calculate feature importance for different kinds of machine learning models. In this section, we'll investigate one tree-based method in a little more detail: Gini impurity.
Gini impurity
Imagine, for a moment, that you're interested in building a model to screen candidates for a particular job. In order to build this model, you've collected some data about candidates who you've hired and rejected in the past. For each of these candidates, suppose that you have data on years of experience and certification status. Consider the following two simple decision trees that use these features to predict whether the candidate was hired:
Which of these features seems to be more important for predicting whether a candidate will be hired? In the first example, we saw that most candidates who had >5 years of experience were hired and most candidates with <5 years were rejected; however, all candidates with certifications were hired and all candidates without them were rejected.
Gini impurity is related to the extent to which observations are well separated based on the outcome variable at each node of the decision tree. For example, in the two trees above, the Gini impurity is higher in the node with all candidates (where there are an equal number of rejected and hired candidates) and lower in the nodes after the split (where most or all of the candidates in each grouping have the same outcome — either hired or rejected).
To estimate feature importance, we can calculate the Gini gain: the amount of Gini impurity that was eliminated at each branch of the decision tree. In this example, certification status has a higher Gini gain and is therefore considered to be more important based on this metric.
Gini importance in scikit-learn
To demonstrate how we can estimate feature importance using Gini impurity, we'll use the breast cancer dataset from sklearn
. This dataset contains features related to breast tumors. The outcome variable is the diagnosis: either malignant or benign. To start, we'll load the dataset and split it into a training and test set:
import pandas as pd
from sklearn . model_selection import train_test_split
from sklearn import datasets
dataset = datasets . load_breast_cancer ( )
X = pd . DataFrame ( dataset . data , columns = dataset . feature_names )
y = dataset . target
X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.25 )
Next, we'll fit a decision tree to predict the diagnosis using sklearn.tree.DecisionTreeClassifier()
. Note that we're setting criterion= 'gini'
. This actually tells the function to build the decision tree by splitting each node based on the feature that has the highest Gini gain. By building the tree in this way, we'll be able to access the Gini importances later.
from sklearn . tree import DecisionTreeClassifier
clf = DecisionTreeClassifier ( criterion = 'gini' )
clf = clf . fit ( X_train , y_train )
Next, we can access the feature importances based on Gini impurity as follows:
feature_importances = clf . feature_importances_
Finally, we'll visualize these values using a bar chart:
import seaborn as sns
sorted_indices = feature_importances . argsort ( ) [ : : - 1 ]
sorted_feature_names = data . feature_names [ sorted_indices ]
sorted_importances = feature_importances [ sorted_indices ]
sns . set ( rc = { 'figure.figsize' : ( 11.7 , 8.27 ) } )
sns . barplot ( sorted_importances , sorted_feature_names )
Based on this output, we could conclude that the features mean concave points
, worst area
and worst texture
are most predictive of a malignant tumor. There are also many features with importances close to zero which we may want to exclude from our model.
Pros and cons of using Gini importance
Because Gini impurity is used to train the decision tree itself, it is computationally inexpensive to calculate. However, Gini impurity is somewhat biased toward selecting numerical features (rather than categorical features). It also does not take into account the correlation between features. For example, if two highly correlated features are both equally important for predicting the outcome variable, one of those features may have low Gini-based importance because all of it's explanatory power was ascribed to the other feature. This issue can be mediated by removing redundant features before fitting the decision tree.
Other measures of feature importance
There are many other methods for estimating feature importance beyond calculating Gini gain for a single decision tree. We'll explore a few of these methods below.
Aggregate methods
Random forests are an ensemble-based machine learning algorithm that utilize many decision trees (each with a subset of features) to predict the outcome variable. Just as we can calculate Gini importance for a single tree, we can calculate average Gini importance across an entire random forest to get a more robust estimate.
Permutation-based methods
Another way to test the importance of particular features is to essentially remove them from the model (one at a time) and see how much predictive accuracy suffers. One way to "remove" a feature is to randomly permute the values for that feature, then refit the model. This can be implemented with any machine learning model, including non-tree-based- methods. However, one potential drawback is that it is computationally expensive because it requires us to refit the model many times.
Coefficients
When we fit a general(ized) linear model (for example, a linear or logistic regression), we estimate coefficients for each predictor. If the original features were standardized, these coefficients can be used to estimate relative feature importance; larger absolute value coefficients are more important. This method is computationally inexpensive because coefficients are calculated when we fit the model. It is also useful for both classification and regression problems (i.e., categorical and continuous outcomes). However, similar to the other methods described above, these coefficients do not take highly correlated features into account.
Conclusion
In this article, we've covered a few different examples of feature importance metrics, including how to interpret and calculate them. We learned about:
- Gini impurity
- How to calculate Gini-based feature importance for a decision tree in
sklearn
- Other methods for calculating feature importance, including:
- Aggregate methods
- Permutation-based methods
- Coefficients
Feature importance is an important part of the machine learning workflow and is useful for feature engineering and model explanation, alike!
Source: https://www.codecademy.com/article/fe-feature-importance-final
0 Response to "Feature Importance Date Continuous Car Sklearn"
Post a Comment