Build a String class which is backed by a Numpy vectors under the hood
Feature Description
We are building a numpy subclass which will support SMPC and ADP. However, we also need to be able to manipulate private strings. The solution is to build a string type which uses vectors of integers under the hood to represent strings, and various types of slicing and algebra to facilitate string manipulations. Using this technique, we get to maintain only one encrypted/DP backend while having higher level data structures like strings.
In this project, you won't be working with encryption or DP, you'll be building the string type that uses numerical methods under the hood. DON'T BE FOOLED, this is a very hard project. The devil's in the details, mostly because you are ONLY ALLOWED to manipulate strings by manipulating their numbers. NO BRANCHING OPERATIONS!!! Circuits only. If you're not sure of the difference between a branching operation and a circuit, either do some reading or skip this github issue for now.
There are many ways to store a string as an integer. We can store at three different levels:
- a number for the entire string
- a number for each word (and a list of words is a list of character-integers)
- a number for each character (and a list of words is a list of character-integers
We can assign a string a number using one of several methods:
- hash the string
- pick a fixed length vocabulary of all strings and assign strings to different numbers in that vocabulary (a lookup table), encoding each string as a one-hot vector indexing at that lookup table
- pick a fixed length vocabulary of all strings and assign strings to different numbers in that vocabulary (a lookup table), encoding each string as the integer location in the vocabulary (the 3rd word in the vocabulary would be represented by a "3")
- pick a fixed length vocabulary of all strings and assign strings to different numbers in that vocabulary (a lookup table), encoding each string as a series of 1/0 integers corresponding to that string.
We don't yet know which ones will be fastest / best for SMPC/DP. There are complex tradeoffs. Feel free to think about those tradeoffs now, but don't let it bother your work too much. In this project, you are to build a string type which can represent strings and their many operations using any permutation of the two lists above.
There are some string operations which are vitally important, and some which are less so. Vital string operations include:
- comparing whether two strings are equal (exact match)
- slicing a string
- concatenating a string
- reversing a string
- hashing a string
- using a string to index into a vector (such as a word embedding matrix)
Some string operations are less important, but could be nice:
- capitalizing/lowercasing a string
- finding characters in substrings
- string search / full text index
Note that some representations mentioned above CANNOT support every vital string operation (for example, if your numerical representation of a string is to simply hash the whole thing, you can do VERY fast comparisons but slicing is impossible). This is acceptable functionality. Just make sure the error messages are clear.
This string object should obviously work remotely and be serializable. Note that we have a new class which makes serialization a lot easier: https://github.com/OpenMined/PySyft/blob/demo_strike_team_branch_4/packages/syft/src/syft/core/common/serde/recursive.py I recommend using it.
Prioritize string comparisons (exact match comparison) as the top priority. Do that as your first unit test because it's blocking another project.
As always - make LOTS OF SMALL COMMITS and PUSH OFTEN. If the project catches up with you I will jump in and help you code. This issue is difficult and mission critical.
@iamtrask is this issue still in need of help? It's more than 3 months old at this point but I'd be happy to work on this if necessary.
The method I'm thinking of is a hybrid data structure consisting of:
- a hash of the string (supports fast string comparison, instant string hash)
- an array of characters (supports reasonably fast slicing, concatenation, reversal)
I assume it will be implemented in Numpy, which I am very comfortable with. In this context, I assume branching = if statements such as np.where and circuits = loops or their vectorized equivalent. Do let me know if I'm missing any details.
Additional ideas for string representation:
- Suffix trees where instead of storing a character we store an array. Linear time and space to store but support a lot of fast string operations. This would be great but I have no idea how you would do it without if statements