1

Since Google is failing me...

Given a user input (string query) and a list of larger strings (like email bodies or something), what's the best way to search and rank the list of strings against the user input.

So far I have implemented levenshtein distance but it doesn't really seem to do extremely well. (Short strings rank very well against each other, whereas long strings **containing** exact matches will go lower in the list)

Should I be splitting the input and the list by word and then averaging the distances?

The only thing I have tried is removing complete non-matches from the list by not including them if the distance is equal to the length of the largest string

Comments
  • 2
    I've actually just found a rather complicated answer after finally using the search terms "document search algorithms" so maybe this will help
  • 2
    @AlgoRythm I had an entire subject about this, would you like the slides?
  • 1
    They're not well structured for rapid reading, but you can extract keywords from them and then puzzle it together yourself, which is exactly what I did immediately before the exam.
  • 1
    @lbfalvy yeah, that would be cool. I'm not actually in any real need, I just had a curiosity and was surprised at how the solution isn't exactly as straightforward as I had thought, especially considering how common of an issue it must be.

    Right now I'm looking at using damerau-leveshtein and tf-idf plus some custom weighting options, maybe
  • 1
    @AlgoRythm That looks like a good start, I'll collect the slides for you but if you've gotten this far there might not be as much novelty for you.
  • 1
    @lbfalvy seeing someone put together everything will help me formulate my master plan.

    If I end up with a product, I might do a simple write-up and share it so we can compare.

    Let's see what a curious (and likely somewhat drunk, considering the three day weekend) casual can do against academia ;)
  • 1
    @AlgoRythm Oh it's not a product, I wrote an exam and the slides are literally just a pile of info on indexing and search engines in general, though the emphasis is on long form text.

    Academia is a scam, I'm finishing my course because of sunk cost and in the hopes that it'll help me get settled in the UK but I'd never sign up knowing what I do now.

    If there ever used to be a minimum level of service it's gone with the pandemic and we're basically receiving a package of YouTube tutorials and accommodation for £11k per year, except the lecturers have the shittiest microphones I've ever had to listen to.
  • 0
    @lbfalvy my use case is client-side searching of small to medium sized data sets, so I'm really just looking for good results and not incredible performance and scaling
  • 0
    @lbfalvy and furthermore by product I just mean deliverable, so anything that works. Demo, prototype, and bad ideas included. Not a literal production ready product.
  • 0
    @AlgoRythm Your best bet is probably to index sequences of words that occur in multiple documents and individual words according to tf-idf, though in the specific case of email you definitely want to also capture words that only occur very rarely in the dataset, even if they aren't repeated within letters. The problem with tf-idf is that it targets long form documents which repeat their subject matter many times, and emails are typically shorter than that.
  • 2
    @lbfalvy yeah from what my research so far is gathering, it really seems like using the available methods (and combinations thereof) to tune your algorithm to the data set you're expecting.

    I just didn't want to call it a day at something like data.filter(item => item.toLowerCase().includes(searchString.toLowerCase())

    And now I'm stuck in a rabbit hole
  • 0
    Information retrival, and ranking algorithems, tf-idf.

    Postgresql has a solution for this.
  • 0
    @AlgoRythm At a minimum, index every individual word that's not a pronoun, then build your algo on top of this dictionary. This should yield an index that's about ½-⅔ the dataset considering the average length of an English word, and your algos will be high powers of log(N) rather than N and possibly exponential to the query length.
  • 0
    This is the stupidest possible approach to the problem I can think of that stays within the constraints of usability. I'm not saying you should do it, but establishing the worst contender that could actually succeed in absence of better options gives some basis to compare serious attempts to.

    Better storage efficiency can be achieved simplest by replacing documents with arrays of indices into the word table, I think that could achieve an insubstantial storage increase compared to raw data while also simplifying algorithm design (great for CPUs that are constantly trying to guess your next move)
  • 0
    I'll get you the slides tomorrow, today was hectic
  • 1
    In case exact matches are enough, look at tries (aka prefix trees). You can index the dataset and then look for words (or their prefixes). Insanely fast lookup times. Just not really useful if you also want to find similar but different words.
  • 0
    Here are the slides I promised

    https://mega.nz/folder/8W5FRAzZ/...
Add Comment