> It is an aggressive defense mechanism that tries its best to take the blunt of the assault, serve them garbage, and keep them off of upstream resources.
> It tries to poison them, so they’d go away forever in the long run.
All I can say is that this is not how any of it works. These people think they're doing something but they have no idea how large scale training works, how data cleaning works, how synthetic data works, and so on. They won't "fool" anyone worth fooling, the bots will still scrape their sites, and so on.
This seems to be a direct consequence of lots of people (even here on HN) constantly repeating a few memes like "regurgitating the training set", "ai bots crawl everything for training data", "we ran out of training data", etc. None of those are true. None of those matter in SotA models. Noone is training on raw scraped data anymore, and they haven't been doing it for 2-3 years. All the recent (1-2 years) gains in models have come from synthetic data + real world generated data (i.e. RL environments).
This is a cute attempt, reminds me of the old tarpit concept from the 2000s, but it won't work, and it will just consume resources for whoever runs it, with 0 benefit downstream. If you want to do something about the crawlers, fix your serving. Don't do work on GETs, serve as much cached content as you can, filter them, even use anubis or the likes. Those things actually matter.
Title should be ...with Iocaine, and the project seems to be this one - https://iocaine.madhouse-project.org/
> It is an aggressive defense mechanism that tries its best to take the blunt of the assault, serve them garbage, and keep them off of upstream resources.
> It tries to poison them, so they’d go away forever in the long run.
All I can say is that this is not how any of it works. These people think they're doing something but they have no idea how large scale training works, how data cleaning works, how synthetic data works, and so on. They won't "fool" anyone worth fooling, the bots will still scrape their sites, and so on.
This seems to be a direct consequence of lots of people (even here on HN) constantly repeating a few memes like "regurgitating the training set", "ai bots crawl everything for training data", "we ran out of training data", etc. None of those are true. None of those matter in SotA models. Noone is training on raw scraped data anymore, and they haven't been doing it for 2-3 years. All the recent (1-2 years) gains in models have come from synthetic data + real world generated data (i.e. RL environments).
This is a cute attempt, reminds me of the old tarpit concept from the 2000s, but it won't work, and it will just consume resources for whoever runs it, with 0 benefit downstream. If you want to do something about the crawlers, fix your serving. Don't do work on GETs, serve as much cached content as you can, filter them, even use anubis or the likes. Those things actually matter.
Subscriber-only