Since Google is failing me... Given a user input (string query) and a list of larger strings (like email bodies or some

Ranter

AlgoRythm

50027

Comments

2

AlgoRythm

50027

3y

I've actually just found a rather complicated answer after finally using the search terms "document search algorithms" so maybe this will help
2

lorentz

15319

3y

@AlgoRythm I had an entire subject about this, would you like the slides?
1

lorentz

15319

3y

They're not well structured for rapid reading, but you can extract keywords from them and then puzzle it together yourself, which is exactly what I did immediately before the exam.
1

AlgoRythm

50027

3y

@lbfalvy yeah, that would be cool. I'm not actually in any real need, I just had a curiosity and was surprised at how the solution isn't exactly as straightforward as I had thought, especially considering how common of an issue it must be.

Right now I'm looking at using damerau-leveshtein and tf-idf plus some custom weighting options, maybe
1

lorentz

15319

3y

@AlgoRythm That looks like a good start, I'll collect the slides for you but if you've gotten this far there might not be as much novelty for you.
1

AlgoRythm

50027

3y

@lbfalvy seeing someone put together everything will help me formulate my master plan.

If I end up with a product, I might do a simple write-up and share it so we can compare.

Let's see what a curious (and likely somewhat drunk, considering the three day weekend) casual can do against academia ;)
1

lorentz

15319

3y

@AlgoRythm Oh it's not a product, I wrote an exam and the slides are literally just a pile of info on indexing and search engines in general, though the emphasis is on long form text.

Academia is a scam, I'm finishing my course because of sunk cost and in the hopes that it'll help me get settled in the UK but I'd never sign up knowing what I do now.

If there ever used to be a minimum level of service it's gone with the pandemic and we're basically receiving a package of YouTube tutorials and accommodation for £11k per year, except the lecturers have the shittiest microphones I've ever had to listen to.
0

AlgoRythm

50027

3y

@lbfalvy my use case is client-side searching of small to medium sized data sets, so I'm really just looking for good results and not incredible performance and scaling
0

AlgoRythm

50027

3y

@lbfalvy and furthermore by product I just mean deliverable, so anything that works. Demo, prototype, and bad ideas included. Not a literal production ready product.
0

lorentz

15319

3y

@AlgoRythm Your best bet is probably to index sequences of words that occur in multiple documents and individual words according to tf-idf, though in the specific case of email you definitely want to also capture words that only occur very rarely in the dataset, even if they aren't repeated within letters. The problem with tf-idf is that it targets long form documents which repeat their subject matter many times, and emails are typically shorter than that.
2

AlgoRythm

50027

3y

@lbfalvy yeah from what my research so far is gathering, it really seems like using the available methods (and combinations thereof) to tune your algorithm to the data set you're expecting.

I just didn't want to call it a day at something like data.filter(item => item.toLowerCase().includes(searchString.toLowerCase())

And now I'm stuck in a rabbit hole
0

magicMirror

10321

3y

Information retrival, and ranking algorithems, tf-idf.

Postgresql has a solution for this.
0

lorentz

15319

3y

@AlgoRythm At a minimum, index every individual word that's not a pronoun, then build your algo on top of this dictionary. This should yield an index that's about ½-⅔ the dataset considering the average length of an English word, and your algos will be high powers of log(N) rather than N and possibly exponential to the query length.
0

lorentz

15319

3y

This is the stupidest possible approach to the problem I can think of that stays within the constraints of usability. I'm not saying you should do it, but establishing the worst contender that could actually succeed in absence of better options gives some basis to compare serious attempts to.

Better storage efficiency can be achieved simplest by replacing documents with arrays of indices into the word table, I think that could achieve an insubstantial storage increase compared to raw data while also simplifying algorithm design (great for CPUs that are constantly trying to guess your next move)
0

lorentz

15319

3y

I'll get you the slides tomorrow, today was hectic
1

deadlyRants

5542

3y

In case exact matches are enough, look at tries (aka prefix trees). You can index the dataset and then look for words (or their prefixes). Insanely fast lookup times. Just not really useful if you also want to find similar but different words.
0

lorentz

15319

3y

Here are the slides I promised

https://mega.nz/folder/8W5FRAzZ/...

Add Comment

question