Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
Jappe29039yOurs is running for over a month and has that too! The most odd sites you will ever see are crawledπ -
DataSecs9279y@linuxer4fun yes, I started off regexing everything by myself. Like just using some bufferd reader and then regexing everything. I then moved on to use JSoup because, well, it offered everything I needed π
I added some features and am now working with a cluster-like Engine. Means you have a Master server which is actually a bot that adds links to a Queue. And every 10 links sends a Packet with the links to a slave, that processes it. You can have several instances of slaves that connect to the master. The slaves are multi-threading, for each link a thread.
The communication is done with netty. -
DataSecs9279y@Jappe Did you yet calculated how many links you can retrieve a minute for example. I'm quite curious of that because I'd like to know what's actually more efficient. To be honest I could just guess -
Jappe29039yWe know that it crawls around 100.000 per hour.
But it depends on how many crawlers are running though. For 100.000 link per hour are about 20 crawlers needed. -
DataSecs9279y@Jappe Oh I expected more. What downloadrate have you got?
I have a semi-fixed thread number, it uses a fixed thread pool which calculates its thread number by the number of available processors. This is for every slave that is connected to the master.
With a 100 Mbit downloadrate it gets 100.000 links per minute and probably completely crawles and indexes 70.000 per minute if not more.
Means with 1 Gbit you could fetch almost 1 million links per minute π -
Jappe29039yAll right that's pretty awesome, but what are the specs of your server/computer? We only have two regular PC's each with only 2 Gb of RAM.. π
Oh and we run it within a crappy school network. So optimisation is everything what we can do to make faster...π -
DataSecs9279y@Jappe Yea I was pretty impressed π
I totally forgot to mention that. I Run a computer with windows and it Till now just ran in IntelliJ not on a server. I have 8 Gb DDR4 and an i7-6700HQ @ 2.6 GHz, it's a Quadcore. So it's neat Hardware. On a server I would probably use a VPs with 2-4 Gb Ram and a decent CPU.
But though my internet downloadrate is the most determining thing actually π
Tested it on school network and threw many exceptions ππ -
Jappe29039y@DataSec That's awesome!! We are going to upgrade both PC's from 2Gb to 4Gb each, so it's gonna be a little faster than it is right now..π
Related Rants


Coworker's whiteboard today
When your crawler starts to find very weird pages on the internet...
undefined
cats
weird
crawler
search engine
java