Harvard University just did a study on the inclusion of toxicity in the corpus for large language models that is informative in the sense that by cleaning data the models underperform on detecting what is called toxicity.
Data cleaning in data science circles is about 95% of the work of forming a model of the observations. But I've been trying to tell people for at least 20 years now that it has been known since the mid-60s that the right way to deal with dirty data is to apply the algorithmic information criterion for dynamical model selection.
The fact that this is not caught on events is a fear of forensic epistemology. No ask yourself this question: who would be afraid of unleashing the power of Moore's law to perform forensic epistemology?
Ourchan looks promising. Certain nations are range-banned, Porn and CP are autodeleted, and it allows the blocking of javascript.
i'll look it up
Fear not! For we have Reddit now.
There is nothing in that sentence I like.
Harvard University just did a study on the inclusion of toxicity in the corpus for large language models that is informative in the sense that by cleaning data the models underperform on detecting what is called toxicity.
Data cleaning in data science circles is about 95% of the work of forming a model of the observations. But I've been trying to tell people for at least 20 years now that it has been known since the mid-60s that the right way to deal with dirty data is to apply the algorithmic information criterion for dynamical model selection.
The fact that this is not caught on events is a fear of forensic epistemology. No ask yourself this question: who would be afraid of unleashing the power of Moore's law to perform forensic epistemology?
https://youtube.com/watch?v=pfap4wLUjTc
Great article.