Train and Test set difference:

sagar maghade
3 min readJun 8, 2021

--

Train and Test set difference:

In an implication of ml model, we face sometimes the problem of train set and test set are too different. It is mainly happening in the time base splitting, it does not occur in random base splitting because every point has an equal chance to get part in both set.

Time Base Splitting

Problem with TBS:

Sometimes the data is changed or updated with respect to time. For example, we have data on amazon's fine food review it will remove some products or it will add some new products with respect to time. Because of too updated or changed data in test set cause to increase in test error and model will not work to efficient.

Impact:

fig. 1, D train function

From above fig. 1 the function is trained using older data.

fig. 2

From fig 2, both function test and train are shown. You can see clearly the spread of both is different and this is a cause of increasing test error. The difference between both is test error. Imagine, the outer side of Dtrain and label as red in the test set will encounter an error.

How to know the Dtrain and Dtest sets are Different or Similar:

It is a very simple hack.

The first step is to split data using tbs in Dtrain and Dtest. and give label 0 to train and 1 to test. And train the model with a binary classifier.

New label data

Classification result and conclusion:

result 1

If the new model has accuracy 95% means 95 % data well-separated means 95 % data different. And 5 % data is similar.

result 2

If the new model has accuracy 3% means 3 % data well-separated means 3% data different and 97 % similar.

result 3

If the new model has accuracy 3% means 3 % data well-separated means 3% data different and 97 % similar.

--

--

sagar maghade
sagar maghade

Written by sagar maghade

I complete my bachelor's from mechanical engineering recently I am studying machine learning

No responses yet