13

While scraping web sources to build datasets, has legality been ever a concern?

Is it a standard practice for checking whether a site prohibits scraping?

Comments
  • 5
    Is data publicly accessible without ANY logins ?
    If yes : No problems
    If not : it’s classified as hacking.
  • 4
    @NoToJavaScript glad to have stumbled on this thread.

    is it still hacking if you scrape informations from your own account?
  • 3
    @NoToJavaScript the second part should be clarified:
    do you a legitimate login? if yes, you can scrape.

    TOS and rate limiting are also a factor here.
  • 5
    The desire of the page to allow bots should be in robots.txt on the website.

    http://www.robotstxt.org/
  • 0
    @Demolishun cool thing.
    just checked. devrant has one, however its a placeholder.
    apparently it has been "moved permanently"
    are there alternative names under which i should look for those?
  • 0
    @bad-frog devrant has an api
  • 2
    @Demolishun :)))
    curl was easier to figure out
    im green as grass in networks
  • 0
    @Demolishun and i have the ambition of starting making a kinda trading bot in 2 months:)))

    well, first stage at least: scrap the webz for all relevant info, in forums and actual quotes, automatize everything so that it spews out the essence.

    once i get decision making right then i will automatize it all the way, but i have no date for that stage

    fun thing is that it kinda comes toegether all by itself.
  • 1
    @bad-frog I already have code which scraps all TSX symbols in real time every 15 seconds haha

    Good luck with the bot tho. After 2 days the best I could do is "not losing money"

    I'm using https://fr.investing.com/equities/... as source and scrap the table

    I then use this data to "play" with bot settings.
  • 1
    @bad-frog it's very quick and dirty as i was mostly doing it for funzies, but if it can help :

    https://pastebin.com/3GzETLab

    Also : last time I tested was about 7 weeks ago, so maybe there are some layout / html changes
  • 1
    @NoToJavaScript 1000 thx

    but isnt trading equity expensive?
    tbh i thought more about crypto bc trading fees are basically inexistent, and my plan for crypto was:

    scrap 4chan/biz and see the occurence of crypto names.

    prolly build a sentiment analyze, maybe tinker with it until i get a real tool

    cross-reference that with crypto quotes

    record statistics so as to see mooning

    i should also scap reddits and tweets and the like

    extend that to names of companies so as to auto- find and verify if there is a squeeze going on. by then i should have enough monetary mass so as to ignore trading fees of playing with the big boys
  • 1
    @bad-frog Well, if the provider (let's say reddit) has API, just use APIs. More efficiant and easy. And it doesn't depends on html changes.

    For trading provider, there are some what allow free trades (no fees).

    Robinhood (US only I think)
    WelathSimple trade (Canada, this is the on eI use)
    Quest Trade
    more.

    Wealthsimple doesn't have trading APIs, BUT the have a website. Sending an order should not be difficult to retro*engeenier with couple of F12 in the browser.

    Fair warning, HTTP is not as fast as you think it is :)

    If you want to scrap all forums and blogs when your bot makes a decision, it's already too late. Look at agregated datasources
  • 0
    @NoToJavaScript 1000 thanks bro, you advanced the whole project by a week at least

    i supposed i had to work with js (which i dont know yet) at a certain point, and now i have a working example
  • 0
    @bad-frog My example is in c# tho
  • 1
    @NoToJavaScript oh, it doesnt have to happen in an instant. also my internet wouldnt allow for a tradebot in the true sense:)

    i will be perfectly content if i get my analysis on a daily basis at first. then maybe increase the frequency to see where i gan get, and with what i can get away...

    i doubt many servers would like being submerged by requests...

    but if i have a 10 second resolution, its good enough to even make statistics about the markets response to news, crossreferenced with forums etc...

    the idea is to have a tool to understand trends and follow them
  • 1
    @NoToJavaScript "but thats C#"
    thats exactly what im saying:p
    if its not C, C++ or python you got me lost

    honorary title: bash
  • 1
    @bad-frog /agree

    Anyway it's a fun project ! I don't have enough motivation to work on it dailly, but every couple on months I add a brick :)

    ir uses lib https://html-agility-pack.net/ which I find very good for html parsing. It even handles "broken" html (to some degree)
  • 0
    @NoToJavaScript that was my intent too.

    the first step will be in two months for me because it ties in with my learning curriculum

    but otherwise i have a few ideas on the backburner too. also to tie in in time.
  • 0
    @NoToJavaScript niiiiice
    c# is also on my personal list so i might start right away

    even tho parsing isnt hard with C like.
    however i see that it builds requests for you and all

    but then i will have to learn how to build those myself soon...

    ill have to build a server in c++. only std maybe some other one or two, selected by the school

    they really want us to know networking in and out for sure...
  • 2
    @bad-frog And 2 rules for scraping data :
    1. Always provide user agent
    2. Always use cookies
    Some sites will reject requests without these 2.

    The most difficult one I ever did was LinkedIn. THAT SHIT Changes something in layout almost every 2 weeks.
  • 0
    @Nanos I would think yes, but to proove it I don't see how.

    I would do it personally
  • 0
    Yes, you need to check whether that site allows scraping or not.
  • 0
    I would just build a "virtual marketplace" for "practicing trading".

    Like a game.

    And then x amount of fake dollars translate into y subperecentage of real dollars.

    So maybe 100k in the game market translates into $10.

    and then the traders that are good, we aggregate their trades and execute them for real.

    Of course the players don't need to know that and couldnt know that anyway.

    Why invent effective AI when you can just crowdsource from people? I figure some small percentage of users are gonna be super predictors or naturally good at what they do.

    Highly unethical of course if they're not informed.
  • 1
    Also, stuff you scrape from websites may be subject to copyright.
  • 1
    Have you heard about robots.txt?
    That file will tell you what the site wants you to grab and ehat they don't.

    You can choose to ignore it.

    Also depending of where you are there are copyright and privacy regulations that can get you in trouble.

    Talk to a lawyer.
Add Comment