End of an era for me: no more self-hosted git

(kraxel.org)

102 points | by dzulp0d 10 hours ago ago

65 comments

data-ottawa 9 hours ago ago

Does anyone know what's the deal with these scrapers, or why they're attributed to AI?
I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.
Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.

[-]
- simonw 9 hours ago ago
  
  I would love to understand this.
  Just a few years ago badly behaved scrapers were rare enough not to be worth worrying about. Today they are such a menace that hooking any dynamic site up to a pay-to-scale hosting platform like Vercel or Cloud Run can trigger terrifying bills on very short notice.
  "It's for AI" feels like lazy reasoning for me... but what IS it for?
  One guess: maybe there's enough of a market now for buying freshly updated scrapes of the web that it's worth a bunch of chancers running a scrape. But who are the customers?
  
  [-]
  - devsda 8 hours ago ago
    
    For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material.
    May be everyone is trying to take advantage of the situation before law eventually catches up.
- arnarbi 8 hours ago ago
  
  > why they're attributed to AI?
  I don’t think they mean scrapers necessarily driven by LLMs, but scrapers collecting data to train LLMs.
- M95D 5 hours ago ago
  
  I stopped trying to understand. Encountering a 404 on my site leads directly to a 1 year ban.
- hsuduebc2 9 hours ago ago
  
  I’m guessing, but I think a big portion of AI requests now come from agents pulling data specifically to answer a user’s question. I don’t think that data is collected mainly for training now but are mostly retrieved and fed into LLMs so they can generate the response. Thus so many repeated requests.
- themafia 9 hours ago ago
  
  There's value to be had in ripping the copyright off your stuff so someone else can pass it off as their stuff. LLMs have no technical improvements so all they can do is throw more and more stolen data into it and hope it, somehow, crosses a nebulous "threshold" where it suddenly becomes actually profitable to use and sell.
  It's a race to the bottom. What's different is we're much closer to the bottom now.
  
  [-]
  - undefined 8 hours ago ago
    
    [deleted]
- danaris 5 hours ago ago
  
  > If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.
  Right, this is exactly what they are.
  They're written by people who a) think they have a right to every piece of data out there, b) don't have time (or shouldn't have to bother spending time) to learn any kind of specifics of any given site and c) don't care what damage they do to anyone else as they get the data they crave.
  (a) means that if you have a robots.txt, they will deliberately ignore it, even if it's structured to allow their bots to scrape all the data more efficiently. Even if you have an API, following it would require them to pay attention to your site specifically, so by (b), they will ignore that too—but they also ignore it because they are essentially treating the entire process as an adversarial one, where the people who hold the data are actively trying to hide it from them.
  Now, of course, this is all purely based on my observations of their behavior. It is possible that they are, in fact, just dumb as a box of rocks...and also don't care what damage they do. (c) is clearly true regardless of other specific motives.
krick 8 hours ago ago

So, what's up with these bots, why am I hearing about that so often lately? I mean, DDoS atacks aren't a new thing, and, honestly, this is pretty much the reason why Cloudflare even exists, but I'd expect OpenAI bots (or whatever this is now) to be a little bit easier to deal with, no? Like, simply having resonable aggressive fail2ban policy? Or do they really behave like a botnet, where each request comes from different IP from a different network? How? Why? What is this thing?

[-]
- recursivecaveat 7 hours ago ago
  
  I doubt it's OpenAI. Maaaybe somebody who sells to OpenAI, but probably not. I think they're big enough to do this mostly in-house and properly. Before AI only big players would want a scrape of the entire internet, they could write quality bots, cooperate, behave themselves, etc. Now every 3rd tier lab wants that data and a billion startups want to sell it, so it's a wild west of bad behavior and bad implementations. They do use residential IP sets as well.
- esseph 6 hours ago ago
  
  The dirty secret is a lot of them come through "residential proxies", aka backdoored home routers, iot devices with shitty security, etc. Basically the scrapers who are often also third party, go to these "companies" and buy access to these "residential proxies". Some are more... considerate than others.
  Why? Data. Every bit of it is it might be valuable. And not to sound tin foil hatty, but we are getting closer to a post-quantum time (if we aren't already ).
  
  [-]
  - tigerlily 2 hours ago ago
    
    How can I detect if my router is backdoored, or being used as a residential proxy?
    
    [-]
    - kimos 2 hours ago ago
      
      If it’s legit you can ask your ISP if they sell use of your hardware. Or just don’t use the provided hardware and instead BYO router or modem or media converter or whatever.
      But I think what OP is implying is insecure hardware being infected by malware and access to that hardware sold as a service to disreputable actors. For that buy a good quality router and keep it up to date.
devsda 8 hours ago ago

At this point, I think we should look at implementing filters that send different response when AI bots are detected or when the clients are abusive. Not just simple response code but one that poisons their training data. Preferably text that elaborates on the anti consumer practices of tech companies.
If there is a common text pool used across sites, may be that will get the attention of bot developers and automatically force them to backdown when they see such responses.

[-]
- fennec-posix 8 hours ago ago
  
  https://anubis.techaro.lol/docs/admin/honeypot/overview The Anubis scraper protection has this as a feature. Just sends garbage if something falls into a trap.
- Vexs 8 hours ago ago
  
  You know, I reckon if you serve up smut or instructions on bomb creation or something they stop hammering you...
Lerc 9 hours ago ago

I presume people have logs that indicate the source for them to place blame on AI scrapers. Is anybody making these available for analysis so we can see exactly who is doing this?

[-]
- JohnTHaller 8 hours ago ago
  
  The big nasty AI bots use 10s of thousands of IPs distributed all over China
  
  [-]
  - krick 7 hours ago ago
    
    So... just blacklist all China IPs? I assume China isn't the primary market for most of complaining site-owners.
- esseph 6 hours ago ago
  
  A lot of compromised home devices and cheap servers proxying traffic, from all over the world.
  
  [-]
  - Lerc 5 hours ago ago
    
    If that is the case how can you determine the reason for the activity?
    
    [-]
    - esseph 5 hours ago ago
      
      Some fake user agent, some tell you who they are. Or.. do they?
      Here-in is the problem. And if you block them, you risk blocking actual customers.
      
      [-]
      - Lerc an hour ago ago
        
        If they are using appropriated hardware, what possible reason could there be for them saying who they are?
vachina 8 hours ago ago

Scrapers are relentless but not DDoS levels in my experience.
Make sure your caches are warm and responses take no more than 5ms to construct.

[-]
- watermelon0 7 hours ago ago
  
  Great, now we need caching for something that's seldom (relatively speaking) used by people.
  Let's not forget that scrapers can be quite stupid. For example, if you have phpBB installed, which by defaults puts session ID as query parameter if cookies are disabled, many scrapers will scrape every URL numerous times, with a different session ID. Cache also doesn't help you here, since URLs are unique per visitor.
  
  [-]
  - kimos 2 hours ago ago
    
    You’re describing changing the base assumption for software reachable on the internet. “Assume all possible unauthenticated urls will be hit basically constantly”. Bots used to exist but they were rare traffic spikes that would usually behave well and could mostly be ignored. No longer.
ptman 4 hours ago ago

Maybe put the git repos on radicle?
JohnTHaller 8 hours ago ago

The Chinese AI scrapers/bots are killing quite a bit of the regular web now. YisouSpider absolutely pummeled my open source project's hosting for weeks. Like all Chinese AI scrapers, it ignores robots.txt. So forget about it respecting a Crawl-delay. If you block the user agent, it would calm down for a bit, then it would just come back again using a generic browser user agent from the same IP addresses. It does this across 10s of thousands of IPs.

[-]
- mono442 4 hours ago ago
  
  Just block the whole China, India and similar countries.
- kevin_thibedeau 8 hours ago ago
  
  Start blocking /16s.
CuriouslyC 9 hours ago ago

Does this author have a big pre-established audience or something? Struggling to understand why this is front-page worthy.

[-]
- jaunt7632 9 hours ago ago
  
  A healthy front page shouldn’t be a “famous people only” section. If only big names can show up there, it’s not discovery anymore, it’s just a popularity scoreboard.
- fouc 9 hours ago ago
  
  because he's unable to self-host git anymore because AI bots are hammering it to submit PRs.
  self-hosting was originally a "right" we had upon gaining access to the internet in the 90s, it was the main point of the hyper text transfer protocol.
  
  [-]
  - geerlingguy 9 hours ago ago
    
    Also converting the blog from something dynamic to a static site generator. I made the same switch partly for ease of maintenance, but a side benefit is it's more resilient to this horrible modern era of scrapers far outnumbering legitimate traffic.
    It's painful to have your site offline because a scraper has channeled itself 17,000 layers deep through tag links (which are set to nofollow, and ignored in robots.txt, but the scraper doesn't care). And it's especially annoying when that happens on a daily basis.
    Not everyone wants to put their site behind Cloudflare.
  - tanduv 8 hours ago ago
    
    sorry if i missed it, but the original post doesn't say anything about PRs... the bots only seem to be scraping the content
    
    [-]
    - fouc 5 hours ago ago
      
      oh you're right, I read "pointless requests" as "PRs", oops!
- ares623 9 hours ago ago
  
  Well the fact that this supposed nobody is overwhelmed by AI scrapers should speak a lot about the issue no?
- bibimsz 9 hours ago ago
  
  the era of mourning has begun
Joel_Mckay 8 hours ago ago

Some run git over ssh, and a domain login for https:// permission manager etc.
Also, spider traps and 42TB zip of death pages work well on poorly written scrapers that ignored robots.txt =3
hattmall 8 hours ago ago

Can we not charge for access? If I have a link, that says "By clicking this link you agree to pay $10 for each access" then sending the bill?

[-]
- simonw 4 hours ago ago
  
  Cloudflare launched a product to do that last year: https://blog.cloudflare.com/introducing-pay-per-crawl/
  I have no idea if it actually works as advertised though. I don't think I've heard from anyone trying it.
- M95D 5 hours ago ago
  
  Send it where?
sdf2erf 9 hours ago ago

[dead]
october8140 8 hours ago ago

You could put it behind Cloudflare and block all AI.
Jaxkr 9 hours ago ago

The author of this post could solve their problem with Cloudflare or any of its numerous competitors.
Cloudflare will even do it for free.

[-]
- denkmoon 9 hours ago ago
  
  Cool, I can take all my self hosted stuff and stick it behind centralised enterprise tech to solve a problem caused by enterprise tech. Why even bother?
  
  [-]
  - FeteCommuniste 9 hours ago ago
    
    "Cause a problem and then sell the solution" proves a winning business strategy once more.
- the_fall 9 hours ago ago
  
  They don't. I'm using Cloudflare and 90%+ of the traffic I'm getting are still broken scrapers, a lot of them coming through residential proxies. I don't know what they block, but they're not very good at that. Or, to be more fair: I think the scrapers have gotten really good at what they do because there's real money to be made.
  
  [-]
  - esseph 6 hours ago ago
    
    Probably more money in scraping than protection...
- rubiquity 9 hours ago ago
  
  The scrapers should use some discretion. There are some rather obvious optimizations. Content that is not changing is less likely to change in the future.
  
  [-]
  - JohnTHaller 8 hours ago ago
    
    They don't care. It's the reason they ignore robots.txt and change up their useragents when you specifically block them.
- simonw 8 hours ago ago
  
  Cloudflare won't save you from this - see my comment here: https://news.ycombinator.com/item?id=46969751#46970522
- Shorel 3 hours ago ago
  
  Cloudflare seems to be taking over all of the last mile web traffic, and this extreme centralization sounds really bad to me.
  We should be able to achieve close to the same results with some configuration changes.
  AWS / Azure / Cloudflare total centralization means no one will be able to self host anything, which is exactly the point of this post.
- Semaphor 8 hours ago ago
  
  For logging, statistics etc. we have the Cloudflare bot protection on the standard paid level, ignore all IPs not from Europe (rough geolocation), and still have over twice the amount of bots that we had ~2 years ago.
- overgard 9 hours ago ago
  
  I'm pretty sure scrapers aren't supposed to act as low key DOS attacks
- isodev 9 hours ago ago
  
  I think the point of the post was how something useless (AI) and its poorly implemented scrapers is wrecking havoc in a way that’s turning the internet into a digital desert.
  That Cloudflare is trying to monetise “protection from AI” is just another grift in the sense that they can’t help themselves as a corp.
- fouc 9 hours ago ago
  
  you don't understand what self-hosting means. self-hosting means the site is still up when AWS and Cloudflare go down.
oceanplexian 9 hours ago ago

It's not that hard to serve some static files @ 10k RPS from something running on modest, 10 year old hardware.
My advice to the OP is if you're not experienced enough, maybe stop taking subtle digs at AI and fire up Claude Code and ask it how to set up a LAMP stack or a simple Varnish Cache. You might find it's a lot easier than writing a blog post.

[-]
- QuiDortDine 9 hours ago ago
  
  Not sure why you're talking like OP pissed in your cheerios. They are a victim of a broken system, it shouldn't be on them to spend more effort protecting their stuff from careless-to-malicious actors.
- simonw 9 hours ago ago
  
  A varnish cache won't help you if you're running something like a code forge where every commit has its own page - often more than one page, there's the page for the commit and then the page for "history from this commit" and a page for every one of the files that existed in the repo at the time of that commit...
  Then a poorly written crawler shows up and requests 10,000s of pages that haven't been requested recently enough to be in your cache.
  I had to add a Cloudflare Captcha to the /search/ page of my blog because of my faceted search engine - which produces may thousands of unique URLs when you consider tags and dates and pagination and sort-by settings.
  And that's despite me serving ever page on my site through a 15 minute Cloudflare cache!
  Static only works fine for sites that have a limited number of pages. It doesn't work for sites that truly take advantage of the dynamic nature of the web.
  
  [-]
  - ninjin 8 hours ago ago
    
    Exactly. The problem is that by their very nature some content has to be dynamically generated.
    Just to add further emphasis as to how absurd the current situation is. I host my own repositories with gotd(8) and gotwebd(8) to share within a small circle of people. There is no link on the Internet to the HTTP site served by gotwebd(8), so they fished the subdomain out of the main TLS certificate. I am getting hit once every few seconds for the last six or so months by crawlers ignoring the robots.txt (of course) and wandering aimlessly around "high-value" pages like my OpenBSD repository forks calling blame, diff, etc.
    Still managing just fine to serve things to real people, despite me at times having two to three cores running at full load to serve pointless requests. Maybe I will bother to address this at some point as this is melting the ice caps and wearing my disks out, but for now I hope they will choke on the data at some point and that it will make their models worse.
  - anonnon 6 hours ago ago
    
    Well, it's heartening to know that AI is making your life at least somewhat less enjoyable.
- aguacaterojo 9 hours ago ago
  
  How would a LAMP stack help his git server?
- anonnon 6 hours ago ago
  
  Your post is pure victim-blaming, as well as normalizing an exploitative state of affairs (being aggressively DDOSed by poorly-behaved scrapers run by Big Tech that only take and never give back, unlike pre-AI search engines, which previously, at least, would previously send you traffic) that was unheard of until just a few years ago.