Balanced Data and Imbalanced Data
imbalanced Data set is one of the real-world scenarios that come to the implementation of machine learning model. for example, cancer prediction data have a lot of -ve as compared to +Ve and in the internet industry where a lot of visitors come to the website but product bought by many little numbers of people.
what is a balanced and imbalanced data set?
balanced dataset where (n1+n2=Dn) n1 and n2 are roughly similar. for example n1 =580 and n2=420 then the dataset is balanced.if n1>> n2 or n2>>n1 then Dn is imbalanced dataset.for example n1=100 and n2=900.
what is the effect of the imbalanced dataset?
if the data is heavily imbalanced then prediction or result not always but biased toward to majority.
As you can see in fig if the Dn is imbalanced then Dtest also imbalanced so the accuracy of the model is high but the model is dumb.
techniques to deal with imbalanced dataset
Undersampling
as the name suggests undersampling means remove data of the majority class and make it up to the minority class.
the main problem with undersampling is removed a lot of data or loss of data means we lose the information. throwing away data is not a good idea.
oversampling
As the name suggests oversampling means creating points of minority class for filling gaps. one technique is repetition point where the previously exist negative point. if there is one +ve point create 9 points with overlapping.
As shown in the figure where red is the majority class and green minority class and yellow is an overlapping point to filling the gap between them. As you can see that repeat is not an effective oversampling technique but is a simple technique. extra polarization is one of the complex idea for oversampling. where creating a minority class of points in where minority points in majority to fill the gap. In simple language create a region where minority point is more. And in that region create a minority point to fill the gap between minority and majority(artificial/synthetic).
As shown in the figure where red is the majority class and green minority class and yellow is an overlapping point to filling the gap between them.
one more idea class weight where we give more weightage to the minority class. In this case 1:9 if the -ve point one there we calculate as 9. but is similar to repeat.