featurebase icon indicating copy to clipboard operation
featurebase copied to clipboard

Tell us how you use Pilosa!

Open alanbernstein opened this issue 7 years ago • 8 comments

If you've used Pilosa for anything, we would love to hear about it! Big, small, work, side project, successful or not (especially if not!), if you have tried Pilosa, we want to know how it went.

alanbernstein avatar Jan 29 '18 18:01 alanbernstein

Looking at Pilosa for segmentation queries. Would be nice to have a simple SQL to PQL mapping so analysts could use it and not have to learn another proprietary query language. Row vs Column nomenclature vs RDBMS is a bit confusing at first so this would help. Something like this - https://calcite.apache.org/docs/tutorial.html

threedliteguy avatar Oct 06 '20 03:10 threedliteguy

@threedliteguy I agree the terminology can be confusing. We do have support for a small but useful subset of SQL in Molecula. Making this more robust and getting it into Pilosa is on our roadmap.

alanbernstein avatar Oct 07 '20 14:10 alanbernstein

I did have some trouble importing data into the docker image of Pilosa when behind a firewall using the instructions, but was able to solve it by setting all the http_proxy/https_proxy/HTTP_PROXY/HTTPS_PROXY variables inside the container to localhost:10101 using run -env and disabling the remote metrics collection.

threedliteguy avatar Oct 07 '20 21:10 threedliteguy

Thought I would mention something I ran into importing data in case someone else has the same issue. My test import csv was 1 billion records with timestamps. After a while of importing, I got the 'too many files' error (I'm on Debian 10 using the pilosa docker image) and attempted to increase the limit to 10 million with /etc/sysctl.conf fs.file-max=10000000 and modifying /etc/security/limits.conf on the host and using the docker --ulimit nofile=10000000:10000000 parameter. I checked ulimit -n inside the docker container using docker exec -it pilosa /bin/sh and it was only about 1 million. The solution for me to get the container number to 10m was to additionally set fs.nr_open=10000000 in /etc/sysctl.conf on the host (then reboot or sysctl -p). While importing, I ran both watch free -m and watch cat /proc/sys/fs/file-nr to see how many files were being opened.

When importing time series events with timestamps, for example events, the number of open files can be large. For example, with time quantum YMDH, that is 24 hours * 365 days * number of years = number of views folders. Multiply times number of event types/keys to get number of fragment files. So if number of event types is 1000, and the date values in the import file range from 1900 to 2020, that is 1000 * 120 * 365 * 24 = 1,051,200,000 open files for that one field. My kernel limit on fs.nr_open is 2,147,483,647.

threedliteguy avatar Nov 06 '20 04:11 threedliteguy

Importing 200m csv records with long column keys (e.g. md5 hashes) for multiple keyed fields caused "500 Internal Server Error:'PANIC:runtime error slice bounds out of range [xxx:xxx]' goroutine 13300 [running] ... lookupKey... insertIDByOffset...

Solution for me was to import numeric column ids instead of string column keys.

threedliteguy avatar Nov 19 '20 16:11 threedliteguy

@threedliteguy Thanks! It might be more helpful to create new Github issues for specific technical problems; this issue was intended to collect general usage feedback.

alanbernstein avatar Nov 20 '20 22:11 alanbernstein

One other suggestion regarding bulk import. I'm converting from Elastic to Pilosa and using curl or an http client in most places. It would be convenient to have a http api bulk import, similar to Elastic _bulk, that takes the same text csv file format as the pilosa command line import. I cant use the SDK in this case and the protobuf API is not convenient for simple clients. Sending Set() commands is also limited in the number of rows it takes. What I ended up doing is making my own http endpoint to serve a /index//field//bulk api with a csv post body as a proxy to the command line import.

threedliteguy avatar Nov 24 '20 00:11 threedliteguy

Also by way of feedback regarding memory usage, I am stress testing to find stability ranges, where of course once it runs out of memory and starts swapping performance drops by an order of magnitude or two. To prevent this I am looking at what the max memory usage would be for a particular data set. Once data is loaded I am assuming it's in memory mapped files, but once it is accessed via a query, it is holding main memory, and doesn't seem to let it go. I wrote a script to get all the key values for all keyed fields and do individual counts on each key value. This appears to force it all into main memory and I can get an idea of what to provision for RAM. [On top of that there is workspace for any simultaneous large query result sets and other overhead which I can observe in test runs. ]

threedliteguy avatar Nov 30 '20 16:11 threedliteguy