Solved: Re: Machine Learning Challenge: Irrelevant Data

npetrele · ‎04-10-2023

Many years ago, I worked for a company that used AI to search for a way to increase the production of a certain medicine. The woman in charge of that projects fed the AI algorithm with all kinds of relevant data. I don't recall which relevant data she used, so I'll make up some data for the purpose of this discussion:

Change in water temperature
Length of exposure to chemical X
Amount of chemical Y
Size of batch

She measured each of these things at regular intervals over a period of weeks and fed that data and the amount of successful production to a machine learning engine designed to identify and rank the most important factors to grow the medicine.

Here's the twist. She threw in irrelevant data, too. In this case, she added the number of sunspots on each of the days. The idea was to give the AI something irrelevant to chew on which should theoretically increase the weight of the relevant data.

It turned out that the AI identified sunspots as the most important factor affecting the production of the medicine.

I recall suggesting to her that sunspots could, indeed, have an influence on the production of the medicine. She rejected that notion, probably rightly so. Regardless, I don't recall the final conclusions.

My question to you is, do you deliberately inject irrelevant data into your machine learning process to draw out the more important factors? If so, have you ever run into an experience like the above, where the algorithm identifies the supposedly irrelevant data as the most important?

davidn# · ‎04-11-2023

I think injecting irrelevant data into a machine learning process is not a common practice because the fundamental principles of ML is to feed the algorithm with relevant data that is representative of the problem at hand. The goal is to train the algorithm to recognize patterns and make accurate predictions based on the data it has been trained on.

Sometimes you would want to intentionally add noise or irrelevant data to the training data set. This technique is called "regularization," and its purpose is to prevent overfitting, which occurs when an algorithm becomes too specialized in recognizing patterns in the training data and fails to generalize well to new, unseen data. These methods work by adding constraints or modifications to the model or the training process, such as adding penalties for large weights or randomly dropping out nodes during training. The goal is not to identify the irrelevant data as important, but rather to help the algorithm focus on the more relevant features in the data.

View solution in original post

davidn# · ‎04-11-2023

I think injecting irrelevant data into a machine learning process is not a common practice because the fundamental principles of ML is to feed the algorithm with relevant data that is representative of the problem at hand. The goal is to train the algorithm to recognize patterns and make accurate predictions based on the data it has been trained on.

Sometimes you would want to intentionally add noise or irrelevant data to the training data set. This technique is called "regularization," and its purpose is to prevent overfitting, which occurs when an algorithm becomes too specialized in recognizing patterns in the training data and fails to generalize well to new, unseen data. These methods work by adding constraints or modifications to the model or the training process, such as adding penalties for large weights or randomly dropping out nodes during training. The goal is not to identify the irrelevant data as important, but rather to help the algorithm focus on the more relevant features in the data.

npetrele · ‎04-11-2023

Yes, this took place in the '80s. The engine she was using wasn't designed to train an algorithm. It was designed to identify the data one should use to train an algorithm.

The company made some fun machines. We had a dual-Z80 box with a connected ultrasonic transducer. We'd use the transducer to look at 30 bad spot welds, and 30 good ones, and then the box could identify bad spot wells with 90%+ accuracy. It was fun stuff.