PittQuantumRepository icon indicating copy to clipboard operation
PittQuantumRepository copied to clipboard

Autocomplete JSON File Suggestions

Open JoshuaRogan opened this issue 10 years ago • 8 comments

So I have auto complete finished and works pretty well. You can search by any part of the name, and part of the formula (ex it will find CO2 by typing in C or O2). I'm just about to add synonyms but that is essentially the same as name so it won't be an issue.

The question I am looking for some feedback on is how I am building the auto complete dictionary. It is going to be a two step processes that doesn't involve querying the database. I determined for about 450KB I can store 1000 molecules embedded in the html, I still can squeeze more out of that number I think but I don't want to add too much size to every page. These molecules should be the most common but I don't have an efficient way to determine that as of yet. After the initial page load, it will make a request to get more, maybe all to further improve the search.

TL;DR Does anyone have an idea to determine what the most common molecules will be based on how we are storing our files? One idea is to start logging search and their results as well which will be able to optimize the auto-complete even further.

JoshuaRogan avatar Oct 11 '15 00:10 JoshuaRogan

How are you storing the molecules? Using a data structure like a prefix tree/DLB trie should be really efficient for this kind of thing.

RitwikGupta avatar Oct 11 '15 00:10 RitwikGupta

Everything is in a Trie in JavaScript. I am using a suggestion engine called Bloodhound( https://github.com/twitter/typeahead.js/blob/master/doc/bloodhound.md). It's very powerful and lightweight. The issue is just getting all of the data to the client. I don't want to query back and forth to the server as that is very slow for something like auto complete where you need results instantly.

JoshuaRogan avatar Oct 11 '15 00:10 JoshuaRogan

Is the data 450KB compressed?

RitwikGupta avatar Oct 11 '15 00:10 RitwikGupta

No it isn't I can for sure fit more for 500KB after some optimization but not 100,000+.

JoshuaRogan avatar Oct 11 '15 00:10 JoshuaRogan

You can check out the file in the auto-complete branch under scripts.

JoshuaRogan avatar Oct 11 '15 00:10 JoshuaRogan

You definitely don't want to match the whole string. People expect that an autocomplete should match a prefix. Shouldn't we be able to transfer ~500KB compressed? With gzip encoding, that should be a good-sized database.

ghutchis avatar Oct 11 '15 01:10 ghutchis

As far as "the most important.." it's hard to judge without data (e.g., search logs) but I'd probably prioritize based on:

  • Is it already in wikipedia (e.g., we have a wiki link in the record)
  • What is the molecular weight and number of atoms (i.e., we should prioritize based on small, less complex molecules)

ghutchis avatar Oct 11 '15 01:10 ghutchis

500KB will be embedded in every page essentially so you can always get some auto complete functionality if for some reason the server doesn't respond quickly the rest of the requests. This is only for 1000 molecules. All of the other molecules will be downloaded after page load asynchronously.

Those are great suggestions and should be easy to generate files for.

JoshuaRogan avatar Oct 11 '15 02:10 JoshuaRogan