11
We3D
349d

What's ur stand 'bout scraping publicly available info? Like is it legal, ethical etc.

Comments
  • 13
    If it's public, go for it.

    Just don't be an ass and scrape the entire site in 10 seconds and ruin someone's day with a massive pile of unexpected hits.
  • 6
    Idk, if it's public then it's for everyone. As long as you don't cause any threat/damage/inconvenience to the system, scrape away
  • 3
    Mostly the above: if its public and you won't transform it into a paid service, just scrape the data but show some manners and make your scraper go easy on their site.
    If you are making an actual product around this, do be aware of the site's ToS as they'll often have a bots/scrapers section depending on the type of data. Then you DO enter the legal/illegal questions.
  • 3
    I would recommend to check if they have some disclaimer, something might be public available but still have legal limits on how it can be used.

    Sure you are most likely allowed to scrape it BUT you might not be allowed to use the result anyway you want, especially commercially.
  • 2
    public stuff is public. You can scrape whatever you want and use it however you want as far as I'm concerned, including for profit (unless it's copyrighted or protected in other ways, in which case it's up to you to decide if what you're doing is transformative enough to make it legal)

    But yeah, go easy on the poor site. For one, no sysadmin wants to wake up to their site being bombed with requests and think it's a DDoS and secondly, we should nurture trust among each other. Anytime someone is greedy with their scraping is a time when a website owner can decide to include scraping protections and limits and that makes no one happy. Even a half a second sleep is huge for web services (though depends on the size of the site :) )
  • 2
    Definitely check ToS of target website and, basically what they say above, ensure legality by scraping sparingly.
  • 1
    good to see that we r all aligned the same way in this case ( at least from current replies ) =]
  • 2
    Any public text, sure.

    Art that people upload to public galleries/portfolios, eh not so much, I'd like to see an opt-in for it that scrapers have to adhere to (good luck with that I guess).

    And easy tools for web developers to add anti ML scraping measures like glaze.
  • 2
    Until now every scraper I made was for personal usage, and I have exactly 0 concerns about that.
  • 0
    @bigmonsterlover b/c the flies...
  • 1
    So far the only time I had to scrape shit it was instagram stuff THAT WE OWNED because that fucking API is idiotic.
  • 2
    @Hazarth Everything is copyrighted by default unless explicitly made public domain, but much can still be freely used.

    But just look at the law suits google have had with newspapers for examples of data that is publicly available but not free to reuse.
  • 1
    if it's public, it's public. end of discussion.
  • 0
    @tosensei yeah, it's kind of like the books copyright protection ( which I never understood btw ). there was this book that said : u can't use anything from this book w/o the permission of the author and to find him there is a url... and the book is nothing special in terms of used word or anything and I'm pretty sure the stories there are mostly made up... so whyyy
  • 0
    @We3D I mean if u use words in your that u coined then ok, but most of the books don't qualify so I really don't know how u can claim that no one can quote your shitty or brilliant phrases and words from your book...I'll really appreciate it if so explains that to me :)
  • 1
    @We3D Words are not copyrightable but paragraphs of text can be, but fair use can apply.

    But it depends on how much you use, if a few paragraphs your most likely fine as long as they are either generic or if you note where you got them.

    But if you scrape the full book and make it available, then you violate copyright, the same if you scrape a full site and make the content available.

    If its a smaller text that is unique then you can use less unless you have asked permission.

    And I only say this as a heads up so you do not end up on the wrong side of an expensive law suit.

    If you scrape for personal use or aggregate enough you can also mostly get away, thats why ChatGTP so far have avoided the court room, the answers are mostly to different even when they come from a copyrighted source.
  • 0
    10x @Voxera, that does make sense, but the problem I see is this : say I wanna write a book, I have read many so I have some capacity of words and phrases. The question in mind is how can I be sure that I won't ( by accident ) use a very popular phrase from one of the books I've read and can be sued about... or I can even quote a fragment from a book I never read by pure fluke. I know u said small fragments are ok. but what if it's so unique that the author can be sure ( not sure how that would happen, even with the help of google.. no matter how many books it knows, they r still not all ) that only he used that phrase/paragraph. also does the law says what is the min word lenght or char lenght that you can sue someone for? aand it's still bit stupid in my eyes. One can't be sure that he is really the one that uses these words in exactly the same order for the first time to have pretentious about their further usage by others...
  • 1
    @We3D Its very much a matter of how much. If its a common expression its safe, if its just a paragraph in big book mostly the same unless its some iconic paragraph.

    And unless its a word by word copy you actually get away with borrowing/stealing very much as long as you have enough own content around it.

    But if your content mostly consist of others creation, like a search engine for quotes that by them self are literal copies, then your more likely to have problem for the dimple reason that its to little of your own unique content.

    Similar with music, even if some short sequence is identical as long as its just a small part of your creation and the rest is different enough you usually get away, but if your music in big parts match up with someone else’s it’s considered a rip of.

    And thats why when scraping, what you do with it is important. If you just republish the content you scrape without enriching it or transforming it in a meaningful way you can get sued, but if what you publish is enough your own then you can often be safe.

    Like if you scrape a site and then list all found words alphabetically with the number of times they where used, its still their content but no one can you your site instead of theirs to really read the content ;)
  • 0
    @Voxera so it's true then when they say u need to change it as low as 20-30% and claim it as ur own...so little creativity these days... ;}
  • 1
    @We3D It probably depends on how you count but it could be true, most words are fillers anyway and do not really contribute to the meaning :)
  • 3
    @We3D forget what everybody said about legality here. The scraping itself is both ethical and legal. Now what you do with the data is where it gets interesting. Internal usage is always fine.
    If you republish the data that is likely copyright infringement. You can publish derived data of course (if you scraped Amazon you can publish how much 1 star crap they have for example).
    There are also special cases legalising publishing snippets to whole websites. This is where search engines operate under and the internet archive is an example of full republication without legal takedown issues. This part is however a grey area and even for Google the ethically it's questionable at times as they prevent traffic to the source with some features.
    As with any legal stuff location and local law applies. See https://en.m.wikipedia.org/wiki/... for quite some info on the legality.

    As @C0D4 said please don't be an ass about it. You won't be blocked instantly after ruining somebody's day either.
Add Comment