web crawler

Ranter

AntaresStar

173

Comments

8

plusgut

5966

8y

* make request to website
* parse html
* optional: execute js
* get your wanted information
* recursion: get links and start at the beginning
3

AntaresStar

173

8y

@plusgut so is it pretty much html parsing?
6

plusgut

5966

8y

@AntaresStar sure, but there are libraries for it. i wouldn't recommend writing one by your own...
2

coolq

4770

8y

@AntaresStar
There are different kinds of web crawling and scraping. What exactly are you trying to do?
3

AntaresStar

173

8y

@coolq i just want to extract all links inside a website.
2

AntaresStar

173

8y

@Alice I'll do it! :)
6

Anaeijon

512

8y

@AntaresStar In my Opinion you don't even need to parse everything.
In my opinion it would be far more efficient and easy to build some regular expression to search for strings starting with http or https.
Add all strings you find to a list.
Remember, that you shouldn't be able to add duplicates. A sorted list might be useful for that. Or even a database system, if the whole thing might get bigger.
Mark the analyzed urls and iterate further over the unmarked elements in the list.
You might want to do this recursively, but this could get really memory consuming.
Better also save the depth to the elements in the list, if you want to stop somewhere. Just set the depth of new items to depth of current item + 1.
There are a lot of opportunities to optimize. For example multiprocessing (even with multiple clients, if you use a database for storing).
Extend the Regex, if you want.
Good start:
^["']https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)["']
2

AntaresStar

173

8y

@Anaeijon thanks!
2

Anaeijon

512

8y

correction:
remove the \b in the regex
1

coolq

4770

8y

@jAsE
Ummm, you never told me this 😂
0

coolq

4770

8y

@jAsE
It's all good mate, no need to talk about it.

If you ever feel up to it, maybe a future rant? Let it all out, but that would be hard. I don't know what it's like, so I can only hope you're all right?
0

coolq

4770

8y

@jAsE
Thanks for elaborating.

Sounds to me like you've got this under control 👍
1

-Neo

48

8y

Take a look at PhantomJS, you can build a link scraper in maybe 10 lines of code with it
0

coolq

4770

8y

@-Neo
PhantomJS is good, try pairing it with Selenium.
0

plusgut

5966

8y

@-Neo phantomjs is dead. The maintainer said, that everyone should use chrome headless.
3

kargaroth

809

8y

Guys, so it's easier to write a post on devrant than just Google the keyword and read any of thousands of articles on web crawlers with examples, pointers and links to resources? o.O
0

AntaresStar

173

8y

@kargaroth it's what I am doing.

But now that I am part of this community I think that I can learn a lot also from your experience.

So thank you all for your comments :)