DS #3 Data reduction using variance threshold, univariate feature selection, recursive feature elimination, PCA

Variance Threshold

Variance Threshold is a feature selector that removes all low-variance features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning. Features with a training-set variance lower than this threshold will be removed.

Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. We compare each feature to the target variable, to see whether there is any statistically significant relationship between them. It is also called the analysis of variance (ANOVA). That is why it’s called ‘univariate’.

  1. f_classif

Recursive Feature Elimination

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. RFE requires a specified number of features to keep. However, it is often not known in advance how many features are valid.

Differences Between Before and After Using Feature Selection

a. Before using feature selection:

Principal Component Analysis (PCA)

The principal components of a collection of points in real coordinate space are a sequence of p unit vectors, where the i-th vector is the direction of a line that best fits the data while being orthogonal to the first i-1 vectors.

PCA Projection to 2D

The original data has four columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4-dimensional into 2 dimensions. The new components are just the two main dimensions of variation.







Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Career Series: Life as a Data Engineer

Finding Seasonal Trends in Time-Series Data with Python

Building a Recommendation System for Site Planning

Build an Interactive Choropleth Map with Plotly and Dash

Latent Stochastic Differential Equations

Predicting COVID-19 Patient Shielding: A Multi-label Classification Approach

How to transition to Data Science from a business background — Interview with Scott Czepiel

A Blueprint for Data Science Presentations

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store