Skip to main content

Importance of data preprocessing in machine learning

                     Data Preprocessing

  • Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Need of Data Preprocessing

  • Inaccurate data
    • There are many reasons for missing data such as data is not continuously collected, a mistake in data entry, technical problems with bio-metrics and much more.
  • The presence of noisy data
    • The reasons for the existence of noisy data could be a technological problem of gadget that gathers data, a human mistake during data entry and much more.
  • Inconsistent data 
    • The presence of inconsistencies are due to the reasons such that existence of duplication within data, human data entry, containing mistakes in codes or names, i.e., violation of data constraints and much more.

Steps Involved in Data Preprocessing

  • Handling Null Values
  • Standardization
  • Handling Categorical Variables
  • One-Hot Encoding
  • Multicollinearity



Learn Data Science Material which helps to learn concepts in Python, Statistics , Data Visualization, Machine Learning , Deep Learning. And it contains Projects helps to understand the flow of building model , and what are the necessary steps should be taken depending on the data set. Interview Questions helps to crack the interview. 





Learn Python from basics to advanced. 



Join ML in python channel in telegram , Where you can learn every concepts in Python, Statistics, Data Visualization, Machine Learning, Deep Learning.

  

Join Aptitude Preparation channel in telegram , this channel helps to crack any interview.

Comments

Popular posts from this blog

Practice Problems in Python [ Part - 1 ]

                                            Python 1. Write a program which will find all such numbers which are divisible by 3 but are not a multiple of 7,between 2000 and 3200 (both included). soln :            def filter_numbers():           """           function to filter out numbers by extracting numbers           which is divisible by 3 but not multiple of 7.           """           filtered_list=[]           for i in range(2000, 3201):               if (i%3==0) and (i%7!=0):                   filtered_list.append(str(i))    ...

Python Introduction

 Introduction  Python is developed by Guido Van Rossum and released in 1991. Python is high level, interpreted, general purpose programming language. It is one of the top five most used languages in the world. Currently there are 8.2 million developers who code in Python. Python is one of the most preferred languages in the field of Data Science and Artificial Intelligence. Key Features Python is an interpreted language, unlike compiled languages like Java, C, C++, C#, Go etc., Python codes are executed directly even before compiling.  Python is Dynamically typed, no need to mention type of variable before assigning. Python handles it without raising any error. Python codes can be executed on different software or operating systems without changing it. Python supports both Functional and Object oriented programming as it supports creating classes and objects. Python has high number of modules and frameworks support. Python is free and Open Source, which means it is availa...

Types of Machine Learning

                                   Machine Learning  Machine Learning is an application of artificial intelligence where a computer/machine learns from the past experiences (input data) and makes future predictions. It finds the pattern in the data , based on the pattern it gives the future predictions from the unseen data.   It is a way to understand the data and find the patterns in that. Types of Machine Learning        Supervised Machine Learning An algorithm learns from example data and associated target responses that can consist of numeric values or string labels.  Generally the algorithm should find the pattern how input and output is mapped           Two types of Supervised Learning: Regression:  The problem is regression type when the output variable is real or continuous. Example :  Predicting salar...