Guarding My Git Forge Against AI Scrapers

(vulpinecitrus.info)

166 points | by todsacerdoti 2 days ago ago

120 comments

mappu 2 days ago ago

Gitea has a builtin defense against this, `REQUIRE_SIGNIN_VIEW=expensive`, that completely stopped AI traffic issues for me and cut my VPS's bandwidth usage by 95%.

[-]
- greenavocado a day ago ago
  
  Are you the only user of your web-facing Gitea? If so, put it behind Wireguard VPN, and basically never worry about bandwidth and security again.
  
  [-]
  - fragmede 21 hours ago ago
    
    So much this. Wireguard is so easy to do and no, the whole world doesn't need access to my shit, just me and a couple of close friends.
  - jauntywundrkind 19 hours ago ago
    
    This is the most assured best way to make sure your remain the only user of your stuff.
    I highly encourage folks to put stuff out there! Put your stuff on the internet! Even if you don't need it even if you don't think you'll necessarily benefit: leave the door open to possibility!
- wiether a day ago ago
  
  I don't understand the purpose of this parameter value?
  I have `REQUIRE_SIGNIN_VIEW=true` and I see nothing but my own traffic on Gitea's logs.
  Is it because I'm using a subdomain that doesn't imply there's a Gitea instance behind?
  
  [-]
  - mappu a day ago ago
    
    Crawlers will find everything on the internet eventually regardless of subdomain (e.g. from crt.sh logs, or Google finds them from 8.8.8.8 queries).
    REQUIRE_SIGNIN_VIEW=true means signin is required for all pages - that's great and definitely stops AI bots. The signin page is very cheap for Gitea to render. However, it is a barrier for the regular human visitors to your site.
    'expensive' is a middle-ground that lets normal visitors browse and explore repos, view README, and download release binaries. Signin is only required for "expensive" pageloads, such as viewing file content at specific commits git history.
    
    [-]
    - wiether 17 hours ago ago
      
      Thanks for the clarification!
      From Gitea's doc I was under the impression that it was going further than "true" so I didn't understood why because "true" was enough for me to not be bothered by bots.
      But in your case you want a middle-ground, which is provided by "expensive"!
- 01HNNWZ0MV43FF a day ago ago
  
  Neat https://docs.gitea.com/administration/config-cheat-sheet#ser...
  > Enable this to force users to log in to view any page or to use API. It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources, for example: block anonymous AI crawlers from accessing repo code pages. The "expensive" mode is experimental and subject to change.
  Forgejo doesn't seem to have copied that feature yet
lgeek 2 hours ago ago

> Worryingly, VNPT and Bunny Communications are home/mobile ISPs
VNPT is a residential / mobile ISP, but they also run datacentres (e.g. [1]) and offer VPS, dedicated server rentals, etc. Most companies would use separate ASes for residential vs hosting use, but I guess they don't, which would make them very attractive to someone deploying crawlers.
And Bunny Communications (AS5065) is a pretty obvious 'residential' VPN / proxy provider trying to trick IP geolocation / reputation providers. Just look at the website [2], it's very low effort. They have a page literally called 'Sample page' up and the 'Blog' is all placeholder text, e.g. 'The Art of Drawing Readers In: Your attractive post title goes here'.
Another hint is that some of their upstreams are server-hosting companies rather than transit providers that a consumer ISP would use [3].
[1] https://vnpt.vn/doanh-nghiep/tu-van/vnpt-idc-data-center-gia... [2] https://bunnycommunications.com/ [3] https://bgp.tools/as/5065#upstreams
FabCH a day ago ago

If you don't need global access, I have found that Geoblocking is the best first step. Especially if you are in a small country with a small footprint and you can get away at blocking the rest of the world. But even if you live in the US, excluding Russia, India, Iran and a few others will cut your traffic by double digit percent.
In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.

[-]
- krupan a day ago ago
  
  This makes me a little sad. There's an ideal built into the Internet, that it has no borders, that individuals around the world can connect directly. Blocking an entire geographic region because of a few bad actors kills that. I see why it's done, but it's unfortunate
  
  [-]
  - halJordan a day ago ago
    
    You can't make the argument that it's a small group of bad actors. It's quite a massive group of unrelentingly malicious actors
    
    [-]
    - tkfoss a day ago ago
      
      I read it as small compared to total population affected by the block
    - 01HNNWZ0MV43FF a day ago ago
      
      Massive in terms of money and power, small in terms of souls
  - FabCH a day ago ago
    
    I know what you mean.
    But the numbers don't lie. In my case, I locked down to a fairly small group of European countries and the server went down from about 1500 bot scans per day down to 0.
    The tradeoff is just too big to ignore.
  - BobaFloutist a day ago ago
    
    It's not because of a few bad actors, it's because of a hostile or incompetent government.
    Every country has (at the very least) a few bad actors, it's a small handful of countries that actively protect their bad actors from any sort of accountability or identification.
    
    [-]
    - victorbjorklund a day ago ago
      
      To be fair most of my bad traffic is from the US.
      
      [-]
      - BobaFloutist 20 hours ago ago
        
        I mean if that's the case, the conversation obviously changes.
- komali2 a day ago ago
  
  Reminds me of when 4chan banned Russia entirely to stop DDOSes. I can't find it but there was a funny post from Hiro saying something like "couldn't figure out how to stop the ddos. Banned Russia. Ddos ended. So Russia is banned. /Shrug"
  
  [-]
  - ralferoo a day ago ago
    
    Similarly, for my e-mail server, I manually add spammers into my exim local_sender_blacklist a single domain at a time. About a month into doing this, I just gave up and added * @* .ru and that instantly cut out around 80% of the spam e-mail.
    It's funny observing their tactics though. On the whole, spammers have moved from bare domain to various prefixes like @outreach.domain, @msg.domain, @chat.domain, @mail.domain, @contact.domain and most recently @email.domain.
    It's also interesting watching the common parts before the @. Most recently I've seen a lot of marketing@, before that chat@ and about a month after I blocked that chat1@. I mostly block *@domain though, so I'm less aware of these trends.
- ThatPlayer a day ago ago
  
  We've had a similar discussion at my work. E-commerce that only ships to North America. So blocking anyone outside of that is an option.
  Or I might try and put up Anubis only for them.
  
  [-]
  - FabCH a day ago ago
    
    Be slightly careful with commerce websites, because GeoIP databases are not perfect in my experience.
    I got accidentally locked out from my server when I connected over Starlink that IP-maps to the US even though I was physically in Greece.
    As a practical advice, I would use a blocklist for commerce websites, and allowlist for infra/personal.
    
    [-]
    - dotancohen 2 hours ago ago
      
      There is a small OTC medical device that is about $60 in the US, quadruple the price in my country. I tried to order one to be sent to a US family member's house, who was coming the following month to visit. However I could not order because I was not in the US.
      In the end I found another online store, paid $74, and got the device. So the better store lost the sale due to blocking non-US orders.
      I don't know how much of a corner case this is.
    - ThatPlayer a day ago ago
      
      That's a good point! I'll probably start with a blocklist.
  - lsaferite a day ago ago
    
    Just keep in mind, that could block legit users who are outside the country. One case being someone traveling and wanting to buy something to deliver home. Another case being a non-resident wanting to buy something to send to family in the service zone.
    I'm not saying don't block, just saying be aware of the unintended blocks and weigh them.
    
    [-]
    - DANmode 17 hours ago ago
      
      Great comment - thank you.
    - fragmede 18 hours ago ago
      
      Also consider tourists outside of their home country. If, eg I'm in Indonesia when Black Friday hits and I'm trying to buy things back home and the site is blocked; shit. I mean, personally I can just use my house as as a VPJ exit node thanks to Tailscale, but most people aren't technical enough to do that.
kstrauser a day ago ago

Anubis cut the accesses on my little personal Forgejo instance with nothing particularly interesting on it from about 600K hits per day to about 1000.
That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.
dspillett a day ago ago

> VNPT and Bunny Communications are home/mobile ISPs. i cannot ascertain for sure that their IPs are from domestic users, but it seems worrisome that these are among the top scraping sources once you remove the most obviously malicious actors.
This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.

[-]
- simonw a day ago ago
  
  I have trouble imagining any home LLM tinkerer who tries to run a naive scraper against the rest of the internet as part of their experiments.
  Much more likely are those companies that pay people (or trick people) into running proxies on their home networks to help with giant scrapping projects what want to rotate through thousands of "real" IPs.
  
  [-]
  - st3fan 15 hours ago ago
    
    Correct. These are called "residential proxies".
- ArcHound a day ago ago
  
  Disagree on the method:
  I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN.
  No client compromise required, it's a networking abuse that gives you good reputation of you use mobile data.
  But yes, selling botnets made of compromised devices is also a thing.
  
  [-]
  - Nextgrid a day ago ago
    
    SIM cards is (one) of the ways the big boys do it. It gives you a nice CGNAT to hide behind and essentially can’t be blocked without blocking a nontrivial chunk of the country. Although more and more fixed-line ISPs are moving to CGNAT too so you can get that advantage there as well.
Bender a day ago ago

Do git clients support HTTP/2.0 yet? Or could they use SSH? I ask because I block most of the bots by requiring HTTP/2.0 even on my silliest of throw-away sites. I agree their caching method is good and should be done when much of the content is cachable. Blocking specific IP's is a never-ending game of whack-a-mole. I do block some data-centers ASN's as I do not expect real people to come from them even though they could. It's an acceptable trade-off for my junk. There are many things people can learn from capturing TCP SYN packets for a day and comparing to access logs sorting out bots vs legit people. There are quite a few headers that a browser will send that most bots do not. Many bots also lack sending a valid TCP MSS and TCP WINDOW.
Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.
[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.
[1] - https://mirror.newsdump.org/bot_test.txt

[-]
- cortesoft a day ago ago
  
  > I do block some data-centers ASN's as I do not expect real people to come from them even though they could.
  My company runs our VPN from our datacenter (although we have our own IP block, which hopefully doesn’t get blocked)
  
  [-]
  - Bender a day ago ago
    
    It's of course optional to block whatever one finds appropriate for their use case. My hobby stuff is not revenue generating so I have more options at my disposal.
    Those with revenue generating systems should capture TCP SYN traffic for while, monitor access logs and give it that college try to correlate bots vs legit users with traffic characteristics. Sometimes generalizations can be derived from the correlation and some of those generalizations can be permitted or denied. There really isn't a one size fits all solution but hopefully my example can give ideas in additional directions to go. Git repos are probably the hardest to protect since I presume many of the git libraries and tools are using older protocols and may look a lot like bots. If one could get people to clone/commit with SSH there are additional protections that can be utilized at that layer.
    [Edit] Other options lay outside of ones network such as either doing pull requests for or making feature requests for the maintainers of the git libraries so that HTTP requests look a lot more like a real browser to stand out from 99% of the bots. The vast majority of bots use really old libraries.
dirkc 2 days ago ago

I'm not 100% against AI, but I do cheer loudly when I see things like this!
I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?

[-]
- hurturue 2 days ago ago
  
  Russia already does that - poisons the net for future LLM pretraining data.
  it's called "LLM grooming"
  https://thebulletin.org/2025/03/russian-networks-flood-the-i...
  
  [-]
  - brabel a day ago ago
    
    This article shows no evidence for anything it claims. None. All of that while claiming we can’t believe almost anything we read online… well you’re god damn right.
    > undermining democracy around the globe is arguably Russia’s foremost foreign policy objective.
    Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.
    When did it become acceptable for journalists to make bold, generalizing claims against whole nations without a single direct, falsifiable evidence of what they claim and worse, making claims like this that can be easily dismissed as obviously false by quickly looking at the policies and their diplomatic interactions with other countries?!
    
    [-]
    - nightpool a day ago ago
      
      They link multiple sources, including a Sunshine Foundation report summarizing other research into the area, and a NewsGuard report where they tested claims from the Pravda network directly against leading LLM chatbots: https://static1.squarespace.com/static/6612cbdfd9a9ce56ef931... https://www.newsguardtech.com/special-reports/generative-ai-...
    - frogperson a day ago ago
      
      Can you point me to any examples of russia doing something good or helping anyone except billionaires? No? Then their reputation is well deserved.
    - ekropotin a day ago ago
      
      As a Russian, I have to say that Putin is indeed way too focused on geopolitics instead of internal state of affairs.
    - nutjob2 a day ago ago
      
      > Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.
      That's actually pretty much spot on.
      
      [-]
      - brabel a day ago ago
        
        When you start believing that there are only good and bad, black and white, them vs us, you know for sure you’ve been brainwashed. Goes to both sides.
        
        [-]
        
        hurturue a day ago ago
        
        so between 0 (good) and 100 (bad), what would be your gray score "badness/evilness" value for the following: Russia, US, China, EU
        yes, i know, it's not a linear axis, it's multi-dimensional perspective thing. so do a PCA/projection and spit one number, according to your values/beliefs
        
        [-]
        
        tkfoss a day ago ago
        
        95,95,95,{depends on the country, from 30 to 100}
        
        nutjob2 a day ago ago
        
        For someone who complains about unsupported claims, you seem to make a lot of them.
        The fact that you think this is something to do with "both sides" instead of a simple question of facts really gives you away.
- zwnow 2 days ago ago
  
  > Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?
  Wasn't there a study a while back showing that a small sample of data is good enough to poison an LLM? So I'd say it for sure is possible.
craftkiller a day ago ago

On my forge, I mirror some large repos that I use for CI jobs so I'm not putting unfair load on the upstream project's repos. Those are the only repos large enough to cause problems with the asshole AI scrapers. My solution was to put the web interface for those repos behind oauth2-proxy (while leaving the direct git access open to not impact my CI jobs). It made my CPU usage drop 80% instantly, while still leaving my (significantly smaller) personal projects fully open for anyone to browse unimpeded.
hashar 2 days ago ago

I do not understand why the scrappers do not do it in a smarter way: clone the repositories and fetches from there on a daily or so basis. I have witnessed one going through every single blame and log links across all branches and redoing it every few hours! It sounds like they did not even tried to optimize their scrappers.

[-]
- dspillett a day ago ago
  
  > I do not understand why the scrappers do not do it in a smarter way
  If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.
  If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.
  If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.
  > why the scrappers do not do it in a smarter way
  A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.
  ----
  [0] the fact this load might be inconvenient to you is immaterial to the scraper
  [1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.
- ACCount37 2 days ago ago
  
  Because that kind of optimization takes effort. And a lot of it.
  Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.
  The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.
- FieryMechanic 2 days ago ago
  
  The way most scrapers work (I've written plenty of them) is that you just basically get the page and all the links and just drill down.
  
  [-]
  - conartist6 a day ago ago
    
    So the easiest strategy to hamper them if you know you're serving a page to an AI bot is simply to take all the hyperlinks off the page...?
    That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached
    
    [-]
    - FieryMechanic a day ago ago
      
      When I used to build these scrapers for people, I would usually pretend to be a browser. This normally meant changing the UA and making the headers look like a read browser. Obviously more advanced techniques of bot detection technique would fail.
      Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.
      
      [-]
      - conartist6 a day ago ago
        
        I guess my point is since it's a subtle interference that leaves the explicitly requested code/content fully intact you could just do it as a blanket measure for all non-authenticated users. The real benefit is that you don't need to hide that you're doing it or why...
        
        [-]
        
        conartist6 a day ago ago
        
        You could add a feature kind of like "unlocked article sharing" where you can generate a token that lives in a cache so that if I'm logged in and I want to send you a link to a public page and I want the links to display for you, then I'd send you a sharing link that included a token good for, say, 50 page views with full hyperlink rendering. After that it just degrades to a page without hyperlinks again and you need someone with an account to generate you a new token (or to make an account yourself).
        Surely someone would write a scraper to get around this, but it couldn't be a completely-plain https scraper, which in theory should help a lot.
        
        [-]
        
        conartist6 a day ago ago
        
        I would build a little stoplight status dot into the page header. Red if you're fully untrusted. Yellow if you're semi-trusted by a token, and it shows you the status of the token, e.g. the number of requests remaining on it. Green if you're logged in or on a trusted subnet or something. The status widget would links to all the relevant docs about the trust system. No attempt would be made to hide the workings of the trust system.
  - tigranbs 2 days ago ago
    
    And obviously, you need things fast, so you parallelize a bunch!
    
    [-]
    - FieryMechanic 2 days ago ago
      
      I was collecting UK bank account sort code numbers (to a buy a database at the time costs a huge amount of money). I had spent a bunch of time using asyncio to speed up scraping and wondered why it was going so slow, I had left Fiddler profiling in the background.
- immibis 2 days ago ago
  
  Because they don't have any reason to give any shits. 90% of their collected data is probably completely useless, but they don't have any incentive to stop collecting useless data, since their compute and bandwidth is completely free (someone else pays for it).
  They don't even use the Wikipedia dumps. They're extremely stupid.
  Actually there's not even any evidence they have anything to do with AI. They could be one of the many organisations trying to shut down the free exchange of knowledge, without collecting anything.
wrxd a day ago ago

I wonder if this is going to push more and more services to be hidden from the public internet.
My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.
This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all
overfeed a day ago ago

For private instances, you can get down to 0 scrapers by firewalling http/s ports from the Internet and using Wireguard. I knew it was time to batten down the hatches when fail2ban became the top process by bytes written in iotop (between ssh log in attempts and nginx logs).
The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.
Artoooooor 20 hours ago ago

Why the hell these bots don't just do a git clone and analyse the source code locally? Much less impact on the server and they would be able to perform the same analysis on all repositories, regardless of what particular git forge offers.

[-]
- grayhatter 18 hours ago ago
  
  what makes you think the webscrapers care what pages they request?
sodimel 2 days ago ago

I, too, am selfhosting some projects on an old computer. And the fact that you can "hear internet" (with the fans going on) is really cool (unless you're trying to sleep while being scrapped).
qudat a day ago ago

This is a great reason why letting websites have direct access to git is not a great idea. I started creating static versions of my projects with great success: https://git.erock.io

[-]
- drzaiusx11 a day ago ago
  
  Do solutions like gitea not have prebuilt indexes of the git file contents? I know GitHub does this to some extent, especially for main repo pages. Seems wild that the default of a web forge would be to hit the actual git server on every http GET request.
  
  [-]
  - danudey a day ago ago
    
    The author discusses his efforts in trying caching; in most use cases, it makes no sense to pre-cache every possible piece of content (because real users don't need to load that much of the repository that fast), and in the case of bot scrapers it doesn't help to cache because they're only fetching each file once.
    
    [-]
    - drzaiusx11 19 hours ago ago
      
      I'd argue every git-backed loadable page in a web forge should be "that fast", at least in this particular use-case.
      Hitting the backing git implementation directly within the request/response loop seems like a good way to burn cpu cycles and create unnecessary disk reads from .git folders, possibly killing you drives prematurely. Just stick a memcache in front and call it a day, no?
      In the age of cheap and reliable SSDs (approaching memory read speeds), you should just be batch rendering file pages from git commit hooks. Leverage external workers for rendering the largely static content. Web hosted git code is more often read than written in these scenarios, so why hit the underlying git implementation or DB directly at all? Do that for POSTs, sure but that's not what we're talking about (I think?)
- lsaferite a day ago ago
  
  Why not render the markdown as HTML in this scenario?
PeterStuer a day ago ago

"Self-hosting anything that is deemed "content" openly on the web in 2025 is a battle of attrition between you and forces who are able to buy tens of thousands of proxies to ruin your service for data they can resell."
I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.
But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.
GoblinSlayer a day ago ago

>Iocaine has served 38.16GB of garbage
And what is the effect?
I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)
>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.

[-]
- nitwit005 a day ago ago
  
  I got the 418 I'm a teapot response.
- oconnore a day ago ago
  
  The only disappointing aspect of the Iocaine maze is that it is not a literal maze. There should be a narrow, treacherous path through the interconnected web of content that lets you finally escape after many false starts.
zoobab a day ago ago

Use stagit, static pages served with a simple nginx is blazing fast and should resist any scrapers.

[-]
- toastal a day ago ago
  
  Darcs by it’s nature can just be hosted by HTTP server too, but without needing a special tool. I use H2O with a small mruby script to throttle IPs.
benlivengood a day ago ago

It would be nice if there was a common crawler offering deltas on top of base checkpoints of the entire crawl; I am guessing most AI companies would prefer not having to mess with their own scrapers. Google could probably make a mint selling access.

[-]
- ccgreg 3 hours ago ago
  
  commoncrawl.org
  Our public web dataset goes back to 2008, and is widely used by academia and startups.
  
  [-]
  - pdimitar 23 minutes ago ago
    
    I always wanted to ask:
    - How often is that updated?
    - How current is it at any point in time?
    - Does it have historical / temporal access i.e. be able to check the history of a page a la The Internet Archive?
stevetron a day ago ago

I was setting up a small system to do web site serving. Mostly just experimental to try out some code. Like learning how to use nginx as a reverse proxy. And learing how to use dynamic dns services since I am on dynamic dns at home. Early-on, I discovered lot's of traffic, and lot's of hard drive activity. The HD activity was from logging. It seemed I was under incessant polling from china. Strange: It's a new dynamic url. I eventually got this down to almost nothing by setting up the firewall to reject traffic from China. That was, of course, before AI scrapers. I don't know what it would do, now.
hurturue 2 days ago ago

in general the consensus on HN is that the web should be free, scraping public content should be allowed, and net neutrality is desired.
do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?

[-]
- johneth a day ago ago
  
  I think, for many, the web should be free for humans.
  When scraping was mainly used to build things like search indexes which are ultimately mutually beneficial to both the website owner and the search engine, and the scrapers were not abusive, nobody really had a problem.
  But for generative AI training and access, with scrapers that DDoS everything in sight, and which ultimately cause visits to the websites to fall significantly and merely return a mangled copy of its content back to the user, scraping is a bad thing. It also doesn't help that the generative AI companies haven't paid most people for their training data.
- komali2 a day ago ago
  
  I'm completely happy for everything to be free. Free as in freedom, especially! Agpl3, creative commons, let's do it!
  But for some reason corporations don't want that, I guess they want to be allowed to just take from the commons and give nothing in return :/
- wrxd a day ago ago
  
  The general consensus here is also that a DDOS attack is bad. I haven't seen objections against respectful scraping. You can say many things about AI scrapers but I wouldn't call them respectful at all.
  
  [-]
  - microtherion a day ago ago
    
    a) There are too damn many of them.
    b) They have a complete lack of respect for robots.txt
    I'm starting to think that aggressive scrapers are part of an ongoing business tactic against the decentralized web. Gmail makes self hosted mail servers jump through arduous and poorly documented hoops, and now self hosted services are being DDOSed by hordes of scrapers…
  - BenjiWiebe a day ago ago
    
    Do people truly dislike an organic DDoS?
    So much real human traffic that it brings their site down?
    I mean yes it's a problem, but it's a good problem.
    
    [-]
    - voidUpdate a day ago ago
      
      If my website got hugged to death, I would be very happy. If my website got scraped to hell and back by people putting it into the plagiarism machine so that it can regurgitate my content without giving me any attribution, I would be very displeased
  - charcircuit a day ago ago
    
    Yet HN does it when linking to poorly optimized sites. I doubt people running forges would complain about AI scrapers if their sites were optimized for serving the static content that is being requested.
- WhyOhWhyQ 2 days ago ago
  
  If net neutrality is a trojan horse for 'Sam Altman and the Antrhopic guy own everything I do' then I voice my support for a different path.
- dns_snek a day ago ago
  
  Net neutrality has nothing to do with how content publishers treat visitors, it's about ISPs who try to interfere based on the content of the traffic instead of just providing "dumb pipes" (infrastructure) like they're supposed to.
  I can't speak for everyone, but the web should be free and scraping should be allowed insofar that it promotes dissemination of knowledge and data in a sustainable way that benefits our society and generations to come. You're doing the thing where you're trying to pervert the original intent behind those beliefs.
  I see this as a clear example of the paradox of tolerance.
  
  [-]
  - pelotron a day ago ago
    
    Just as private businesses are allowed "no shirt, no shoes, no service" policies, my website should be allowed a "no heartbeat, no qualia, no HTTP 200".
captn3m0 2 days ago ago

I switched to rgit instead of running Gitea.
jepj57 a day ago ago

What about a copyright on websites stating anyone using your site for training would be giving the owner of the site an eternal non-revocable license to the model, and must provide a copy of the model upon request? At least then there would be SOME benefit.

[-]
- adastra22 a day ago ago
  
  Contract law doesn’t work that way.
evgpbfhnr a day ago ago

I had the same problem on our home server.. I just stopped the git forge due to lack of time.
For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...
ArcHound 2 days ago ago

Seems like you're cooking up a solid bot detection solution. I'd recommend adding JA3/JA4+ into the mix, I had good results against dumb scrapers.
Also, have you considered Captchas for first contact/rate-limit?
If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.
pabs3 2 days ago ago

> the difference in power usage caused by scraping costs us ~60 euros a year
klaussilveira a day ago ago

I wish there was a public database of corporate ASNs and IPs, so we wouldn't have to rely on Cloudflare or any third-party service to detect that an IP is not from a household.

[-]
- wrxd a day ago ago
  
  Scrapers use residential VPNs so such a database would help only up to a certain point
- eddyg a day ago ago
  
  Just search for "residential proxies" and you'll see why this wouldn't help.
- immibis 7 hours ago ago
  
  There is one. It's called the RIRs.
- ronsor a day ago ago
  
  There is... It's literally available in every RIR database through WHOIS.
yunnpp a day ago ago

Thanks for putting that together. Not my daily cup but it seems like a good reference for server setup.
reactordev a day ago ago

I host all my stuff behind a vpn. No one but authorized users can get access.
krupan a day ago ago

I'm case you didn't read to the end:
"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."
frogperson a day ago ago

Could this be solved with an EULA and some language that non-human readers will be billed at $1 per page? Make all users agree to it. They either pay up or they are breaching contract.
Is this viable?

[-]
- hamdingers a day ago ago
  
  Say you have identified a non-human reader, you have a (probably fake) user agent and an IP address. How do you imagine you'll extract a dollar from that?
- kstrauser a day ago ago
  
  Most of my scraper traffic came from China and Brazil. How am I going to enforce that?
- grayhatter a day ago ago
  
  > Is this viable?
  no
  for many reasons
xyzal 2 days ago ago

Does anyone have an idea how to generate, say, insecure code, en masse? I think it should be the next frontier. Not feed them random bytestream, but toxic waste.

[-]
- moooo99 2 days ago ago
  
  Ironically, probably the fastest way to create insecure code is by asking AI chatbots to code
- tpxl a day ago ago
  
  Create a few insecure implementations, parse them into an AST, then turn them back into code (basically compile/decompile) except rename the variables and reorder stuff where you can without affecting the result.
frozenseven a day ago ago

>When I pretend to be human
>i am lux (it/they/she in English, ça/æl/elle in French)
This blog is written by an insane activist who's claiming to be an animal.

[-]
- bavent a day ago ago
  
  So? Does that somehow invalidate this article?
  
  [-]
  - frozenseven 18 hours ago ago
    
    First 20 seconds of reading the article already indicated something like this. It's all the same.
    
    [-]
    - bavent 17 hours ago ago
      
      I don’t think you actually read the article then. The first few paragraphs go into how many pages a git repo can actually be. In fact, you had to actually click on an entirely different link to read about them being a furry, meaning you purposefully went looking for something to complain about, rather than having an argument based on the merits of the post alone.
      
      [-]
      - frozenseven 16 hours ago ago
        
        There are two unfortunate realities here. One, the blatant obviousness of this general complex of ideas and behaviors. Two, how I won't take the knee and just listen to a guy who's broadcasting his b*stiality-adjacent fetish to the world.
        Sell it to someone else.
        
        [-]
        
        bavent 7 hours ago ago
        
        I’m no fan furries either, but I also don’t see what bearing someone’s personal life has on how to configure a git forge. Maybe you should grow the fuck up?
        
        [-]
        
        frozenseven 5 hours ago ago
        
        >personal life
        >Maybe you should grow the f*ck up?
        Not very "personal" when this form of psychopathy oozes through on every step and happens to be on full public display. But that's the intent, for both of youse. And silent capitulation is the minimal subscription. But that's not happening. Eat sh*t.