Meta's crawler made 11M requests to my site in 30 days

(old.reddit.com)

53 points | by speckx a day ago ago

26 comments

danpalmer a day ago ago

Meta shouldn't be doing this, they need to be more careful, but...
I used to work on a site with basic caching, a big web framework, every page dynamic, and 3 frontend webservers plus a database primary and replica. Super basic infra, and a bill close to this user.
We would never have noticed 3 to 4 requests per second like this. And we weren't being that smart about it, we were rendering every page not serving from cache (we mostly cached DB results). We were also conscious of not accidentally building SEO bot traps that would cause them to go around in loops, not because of the traffic generated, but because it was bad for SEO!
This just strikes me as bad engineering on both sides. Yes Meta is the one with the big budgets and they should sort this out, but also you can't pay 10-100x for your infra and get annoyed when you have a big bill. On the web people and bots are going to make requests and you just have to design for that.
laborcontract a day ago ago

Obviously horrendous but why isn’t this person monitoring his site?
Also, why do people use vercel nowadays? I’m sure there are reasons, but I moved over to railway (you can insert alternative provider here) and I no long f* around trying to fix page load time due to cold starts, I have predictable pricing, and my sites on railway are fast so much faster. Plus, if cost is a factor, railway offers serverless. It’s not as shiny as vercel, but nextjs works perfectly on it.
It astounds me that vercel has positioned themselves as a sanctuary city for normies and yet, the city is littered with landmines and booby traps.

[-]
- spiderfarmer a day ago ago
  
  Don’t underestimate the amount of people who don’t care how their companies money is spent.
blell a day ago ago

Crazy to me that someone would run a website where you pay for every request you receive, instead of a fixed monthly rate. It’s an obvious recipe for disaster - crossing the wrong guy would cost you dearly. Or just a crawler running amok.
direwolf20 a day ago ago

So sue Meta. Denial of service is a crime.
JasonADrury a day ago ago

That's like 4 requests per second, hardly seems excessive at all. We're not on dial-up anymore.

[-]
- reassess_blind a day ago ago
  
  You’re not serious, right?
  
  [-]
  - JasonADrury a day ago ago
    
    I am. Modern computers and network connections are so fast that this amounts to literally nothing. It's standard internet background noise and it's really not a problem.
  - spiderfarmer a day ago ago
    
    [flagged]
    
    [-]
    - JasonADrury a day ago ago
      
      > He probably thinks the internet works on static 5k html pages, while the norm is 100kb, dynamically generated pages.
      I just work on web stuff that people actually use. It's 2026, thousands of requests per second is nothing. You'll probably be fine even with stock apache2 and some janky php scripts.
      A single gbit line will serve a 100kB page thousand times a second without issues.
      Dynamically generated pages you can't easily serve at rates in excess of tens of thousands of requests per second from commodity hardware are extremely rare.
      
      [-]
      - danpalmer 17 hours ago ago
        
        For most web apps, bandwidth won't be the issue, it'll still be I/O bound, or maybe CPU bound.
        
        [-]
        
        JasonADrury 6 hours ago ago
        
        Sure, but CPUs and I/O are so fast now that it's genuinely difficult to hit those bottlenecks unless you're doing something weird.
        Also, hardware these days is good enough that a CRUD web app could very well be bandwidth limited.
    - lurking_swe 21 hours ago ago
      
      i don’t think you realize how fast modern CPUs are. If this stresses your server out, you probably have no business hosting things publicly on that server. This person is hosting stuff on Vercel using serverless which is the root of their problem.
      4 request per second is just noise. it’s like complaining about car noise when deciding to buy a house next to the freeway. Exposing things publicly on the internet means _anyone_ can try talking to your server. Real users, bots, hackers, whatever. You can’t guarantee bots are bug-free!
      Dynamic content is _typically_ served to logged in users. Content that is public facing is typically cached, for obvious reasons. Of course Meta should fix this…but using Vercel and serverless in this manner is a very poor choice.
      
      [-]
      - spiderfarmer 20 hours ago ago
        
        Meta isn’t going to fix this because they have your mindset.
        Meanwhile, my website with 48M pages over 8 domains is getting hammered with over 200 req/s 24/7 from AI bots in addition to the regular search engine bots. It seems like every day new bots appear that all want to download every single one of my URL’s.
        To me it’s not background noise. It’s a problem. It simply requires a lot of CPU power and traffic. I could do with 95% less resources and have faster response times for my actual users if these bots would just bugger off.
    - justcool393 a day ago ago
      
      even 100 kB dynamically generated pages should be a piece of cake. if it's CRUD like (original op's site is), it should be downright trivial to transfer that much on like... shared hosting (although even a VPS would be much better).
      (in original op's case, i clocked 197 requests using 20.60 MB while browsing their site for a little bit. most of it is static assets and i had caching disabled so each new pageload loaded stuff like the apple touch icons.)
      honestly you could probably put it behind nginx for the statics and just use bog standard postgres or even prolly sqlite. nice bonus in that you don't have to worry about cold start times either!
    - direwolf20 a day ago ago
      
      I don't have a car. I don't need one because trains exist. My website can also handle 4 requests per second.
undefined a day ago ago

[deleted]
spiderfarmer a day ago ago

I have the same problem. I have 6M URL’s per domain. 8 different domains. 80% of search traffic is long tail.
If I don’t block, 95% of my resources will be spent on feeding bots.
I had to block all “official” AI useragents and entire countries like Singapore and China. But there are so many unofficial bots which spread their work over dozes of IP addresses that it seems impossible to block on the reverse proxy level. How do you block those?

[-]
- JasonADrury a day ago ago
  
  >If I don’t block, 95% of my resources will be spent on feeding bots.
  Okay, but why should you care? Resource usage for a regular website that isn't doing some heavy dynamic stuff or video streaming tends to be rather negligible. You can easily serve 100k+ requests per second on servers that costs less than $100/mo.
  It really shouldn't be worth your time to do anything about the bots unless you're doing something unusual.
  
  [-]
  - spiderfarmer a day ago ago
    
    Believe it or not, but the website is not a static txt file.
    
    [-]
    - JasonADrury a day ago ago
      
      Anything significantly more complicated than CRUD apps like HN is pretty rare on the web.
      If the resource usage of a website is a concern, either your code is straight up broken or you're doing something rather unusual. While doing unusual things, it's normal to encounter unusual problems. However, when encountering an unusual problem it's good to stop for a moment and consider if your approach is wrong.
      At some point the only good way to stop scraping becomes paywalls. You can't defeat sophisticated scrapers through any other means.
      
      [-]
      - spiderfarmer 20 hours ago ago
        
        So you’re blaming the destruction of the open internet on the technical prowess of indie developers like me and not on the greedy big tech leeches with thousands of mindless developers who do everything in their power to make life worse for the little guys.
        
        [-]
        
        JasonADrury 6 hours ago ago
        
        I don't think the open internet is being destroyed at all. This is just the usual complaining about internet background noise that's been happening for decades.
        Is there more background noise than before? Yes, probably. Is it a big deal yet? Still not.
- kjok a day ago ago
  
  Block based on cookies (i.e., set a cookie on the browser and check on the server whether it exists).
  
  [-]
  - direwolf20 a day ago ago
    
    This project implements a variety of similar JSless checks, such as image loading
    https://github.com/WeebDataHoarder/go-away
- decremental a day ago ago
  
  [dead]