Saturday, February 18, 2023

not MNIST

 I've been reading a little more about what other people do with the MNIST recognition. Apparently, doing the transformations on the training set is fairly common, so maybe a simple linear stretching and shrinking to multiply the number of examples would be good enough.

But the most interesting discovery has been that the dataset I've been playing with is not MNIST. Surprise, surprise. :-)

 The MNIST data set is much larger, and also has a higher resolution. It might be the older NIST dataset (without the "M" that stands for "Modified"), or maybe not even that. For some reason I've thought that the MNIST set comes from the ZIP codes recognition, so that's what I was looking for, the ZIP codes dataset, and apparently I've thought wrong. An interesting thing about the original NIST dataset is that it had the training set and the test set collected from different demographies (one from schoolchildren, another from employees of a government agency), so I guess if this is what I've got, it would explain why the sets don't represent each other so well. That was apparently a known major complaint with the older dataset that got straightened in the new modified one.

No comments:

Post a Comment