4

There are sites that list different datasets, this makes them dataset of datasets.
As there are multiple of these sites, and they are listed on Google, that made Google dataset of dataset datasets.

Comments
  • 2
    That's datasetception. By this logic, only google has to be crawled by LLM's. Tbh, google does have the biggest collection of resources to make the biggest LLM. It can just fetch full content instead of small descriptions. They have a lot of potential. But I have no idea of the amount of data is something that limits an LLM. I think there is a think like too much data / bad quality that you won't need. The current LLM's probably have consumed all high quality and some low quality data there is. The rest probably won't matter for learning purposes. To give an idea of it's depth - it can generate devrant API clients for dotnet I found out recently. That's quite impressive, since i've made an api client myself and the information is not in one place. The unofficial devrant api is incomplete and even misses the api url. But since it knows even that, i consider that they have enough knowledge. It's probably more about context analysis, interpretation skills of current data.
Add Comment