DS #6 Data Preprocessing with Orange Tool
This blog is about data preprocessing using orange tool . Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase “garbage in, garbage out” is particularly applicable to data mining and machine learning projects.
In the Orange library in python and perform various data preprocessing tasks like Discretization, , Randomization, and Normalization on data with help of various Orange functions .
For basic information about orange tool please refer my previous blog . Click here .
In the Orange tool canvas, take the Python script from the left panel and double click on it.
Discretization helps handle outliers by placing these values into the lower or higher intervals together with the remaining inlier values of the distribution. Thus, these outlier observations no longer differ from the rest of the values at the tails of the distribution, as they are now all together in the same interval/bucket. In addition, by creating appropriate bins or intervals, discretization can help spread the values of a skewed variable across a set of bins with an equal number of observations.
Given a data table, return a new table in which the discretize attributes are replaced with continuous or removed.
- binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables, depending upon the argument zero_based.
- multinomial variables are treated according to the argument multinomial_treatment .
- discrete attribute with only one possible value are removed.
The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one and others will be zero. This is the default behavior.
For example as shown in the below code snippet, dataset “titanic” has feature “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuation replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute(on lower scale) because of other attribute having values on larger scale. We use the Normalize function to perform normalization.
With randomization, given a data table, preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.
I hope now you can work by yourself in the orange tool. I tried to cover as many things as I can. Now you can explore more by yourself.
Do check out more features of the Orange tool here.