Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "web crawler"
-
I wrote a web crawler to find a few problems on my company's blog. I accidentally DOS'd the peice of shit with a single thread. Fuck WordPress.3
-
We got DDoS attacked by some spam bot crawler thing.
Higher ups called a meeting so that one of our seniors could present ways to mitigate these attacks.
- If a custom, "obscure" header is missing (from api endpoints), send back a basic HTTP challenge. Deny all credentials.
- Some basic implementation of rate limiting on the web server
We can't implement DDoS protection at the network level because "we don't even have the new load balancer yet and we've been waiting on that for what... Two years now?" (See: spineless managers don't make the lazy network guys do anything)
So now we implement security through obscurity and DDoS protection... Using the very same machines that are supposed to be protected from DDoS attacks.17 -
!rant
2 days ago I made an image crawler with a web interface that scrapes whatever url you put in and returns a list of images that it found on that website, right now it's only limited to img elements
It was pretty fun to work on it during the weekend :39 -
Just finished my third year of my comp sci degree when a friend found me a position at a very small startup. I was asked to build a web crawler to take job postings off kijiji and craigslist and place them in our database for our clients to find. It didn't take long to build (even with limited experience). It was pretty shady. I didn't think i'd have to deal with the ethics of a task so soon in my new dev-life! Luckily it never made it to the live site. After that they got me to work on their android app (not so shady)
4 years later i still work for that company building apps. It's still a small team, and i love 'em 🤙1 -
I get an email about an hour before I get into work: Our website is 502'ing and our company email addresses are all spammed! I login to the server, test if static files (served separately from site) works (they do). This means that my upstream proxy'd PHP-FPM process was fucked. I killed the daemon, checked the web root for sanity, and ran it again. Then, I set up rate limiting. Who knew such a site would get hit?
Some fucking script kiddie set up a proxy, ran Scrapy behind it, and crawled our site for DDoS-able URLs - even out of forms. I say script kiddie because no real hacker would hit this site (it's minor tourism in New Jersey), and the crawler was too advanced for joe shmoe to write. You're no match for well-tuned rate-limiting, asshole!1 -
So I go into Google Search Console to try to determine why its saying pages are not compliant with mobile and whatnot.
3 hours later I come out realizing that what Google REALLY wants is for everyone to build every web page as static HTML with no script tags and never a call to an external website. Just dump all that javascript and css and HTML into one BIG FRIGGIN FILE so our crawler feels satisfied that it's loading everything all in one request.
No CMSes allowed unless GOOGLE built it.
Let's just all revert to HTML 1.0 and be done with it.1 -
I remember back when I was in pre calculus I decided to take a class online. So my teacher's website was made by him and run on go Daddy, he taught precalculus, calculus, algebra, algebra ii, and computer science. I decided to penetration test his website and use a web crawler. His directory that had the tests, test answers, exams, exam answers, and homework answer's as well as all the books he's written in PDFs, was unprotected, I could access and download them all. He also had a database directory that contained all the students' phone numbers, email addresses, home addresses, and their full names.
I alerted him to this and didn't get anything in turn :P2 -
Just posted this in another thread, but i think you'll all like it too:
I once had a dev who was allowing his site elements to be embedded everywhere in the world (intentional) and it was vulnerable to clickjacking (not intentional). I told him to restrict frame origin and then implement a whitelist.
My man comes back a month later with this issue of someone in google sites not being able to embed the element. GOOGLE FUCKING SITES!!!!! I didnt even know that shit existed! So natually i go through all the extremely in depth and nuanced explanations first: we start looking at web traffic logs and find out that its not the google site name thats trying to access the element, but one of google's web crawler-type things. Whatever. Whitelist that url. Nothing.
Another weird thing was the way that google sites referenced the iframe was a copy of it stored in a google subsite???? Something like "googleusercontent.com" instead of the actual site we were referencing. Whatever. Whitelisted it. Nothing.
We even looked at other solutions like opening the whitelist completely for a span of time to test to see if we could get it to work without the whitelist, as the dev was convinced that the whitelist was the issue. It STILL didnt work!
Because of this development i got more frustrated because this wasnt tested beforehand, and finally asked the question: do other web template sites have this issue like squarespace or wix?
Nope. Just google sites.
We concluded its not an issue with the whitelist, but merely an issue with either google sites or the way the webapp is designed, but considering it works on LITERALLY ANYTHING ELSE i am unsure that the latter is the answer.2 -
I wrote a site specific web crawler for my job(debt collection agency) spent well over 175 hours on it, database integration, remote spider deployment, easy GUI. I need a major raise if I'm going to give it to them though, I'm currently making 12/hr under the title of legal assistant, think I could get a raise to 23/hr. What would you do?9
-
Anyone know how to use a proxy for a web crawler written in native Java for Android. I have a bug in an app in production that only surfaced after being used for a couple of days and I urgently need to fix it.
HELP!!!