Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
Jappe29138yOurs is running for over a month and has that too! The most odd sites you will ever see are crawledπ
-
DataSecs9348y@linuxer4fun yes, I started off regexing everything by myself. Like just using some bufferd reader and then regexing everything. I then moved on to use JSoup because, well, it offered everything I needed π
I added some features and am now working with a cluster-like Engine. Means you have a Master server which is actually a bot that adds links to a Queue. And every 10 links sends a Packet with the links to a slave, that processes it. You can have several instances of slaves that connect to the master. The slaves are multi-threading, for each link a thread.
The communication is done with netty. -
DataSecs9348y@Jappe Did you yet calculated how many links you can retrieve a minute for example. I'm quite curious of that because I'd like to know what's actually more efficient. To be honest I could just guess
-
Jappe29138yWe know that it crawls around 100.000 per hour.
But it depends on how many crawlers are running though. For 100.000 link per hour are about 20 crawlers needed. -
DataSecs9348y@Jappe Oh I expected more. What downloadrate have you got?
I have a semi-fixed thread number, it uses a fixed thread pool which calculates its thread number by the number of available processors. This is for every slave that is connected to the master.
With a 100 Mbit downloadrate it gets 100.000 links per minute and probably completely crawles and indexes 70.000 per minute if not more.
Means with 1 Gbit you could fetch almost 1 million links per minute π -
Jappe29138yAll right that's pretty awesome, but what are the specs of your server/computer? We only have two regular PC's each with only 2 Gb of RAM.. π
Oh and we run it within a crappy school network. So optimisation is everything what we can do to make faster...π -
DataSecs9348y@Jappe Yea I was pretty impressed π
I totally forgot to mention that. I Run a computer with windows and it Till now just ran in IntelliJ not on a server. I have 8 Gb DDR4 and an i7-6700HQ @ 2.6 GHz, it's a Quadcore. So it's neat Hardware. On a server I would probably use a VPs with 2-4 Gb Ram and a decent CPU.
But though my internet downloadrate is the most determining thing actually π
Tested it on school network and threw many exceptions ππ -
Jappe29138y@DataSec That's awesome!! We are going to upgrade both PC's from 2Gb to 4Gb each, so it's gonna be a little faster than it is right now..π
Related Rants
When your crawler starts to find very weird pages on the internet...
undefined
cats
weird
crawler
search engine
java