mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Setting attributes should have dtype checks

Open richardjgowers opened this issue 5 years ago • 20 comments

Our Topology system is statically typed, each TopologyAttr (found here) should only accept one data type, e.g. atom names (here) should always be strings, resids should always be ints.

Currently setting an atom name to an integer gives "TypeError: 'int' object is not iterable"

import MDAnalysis as mda

u = mda.fetch_mmft('181l')

u.atoms[0].name = 1

This should instead fail and inform that atom names need to be strings.

Setting a Residue's resid to a float seems to work...

import MDAnalysis as mda

u = mda.fetch_mmft('181l')

u.residues[0].resid = 2.4

This probably is using the inbuilt rounding in numpy int arrays, but it should probably raise an error too.

To fix this, we could have a check_dtype decorator around TopologyAttr.set_atoms (and residue/segment) to check that the supplied values are the correct type.

richardjgowers avatar Jun 16 '20 10:06 richardjgowers

@richardjgowers i would like to work on this issue

darmis007 avatar Jun 16 '20 16:06 darmis007

I would like to work on this issue

b-thebest avatar Jun 27 '20 07:06 b-thebest

@richardjgowers sir, can you please review my Pull Request #2788 regarding this issue

darmis007 avatar Sep 03 '20 10:09 darmis007

@richardjgowers is this issue still relevant? Is there anything where you'd want to describe more clearly what you'd want to see so that the issue becomes suitable for the GSOC-starter label?

orbeckst avatar Apr 06 '22 08:04 orbeckst

is this issue still open to work on

olivia632 avatar Apr 16 '22 08:04 olivia632

I don't see an open PR referencing this issue so please feel free to open a PR. Make sure that you fill in the PR comment template properly and reference this issue.

orbeckst avatar Apr 18 '22 22:04 orbeckst

Hello @orbeckst. I'm Fortune from the Outreachy program. I just opened a pull request for this issue.

fortune-max avatar Apr 22 '22 14:04 fortune-max

@richardjgowers I'm not sure you can blindly enforce dtypes without also giving a preferred actual type; the dtype of string arrays is set to object, which an integer also is. Other people have also wanted to set TopologyAttrs of arbitrary type as well, so we can't assume every object is a string. The fix for strings might need to edit _StringInternerMixin._set_X to just warn or error if a non-string is given.

The TypeError raised in the initial message isn't just because the dtype is wrong, it's because the value setter assumes the given value is either a string or a list. Then no further type checking is done. Setting multiple names works fine:

>>> u.atoms[:3].names = [1, 2, 3]
>>> u.atoms[:3].names
array([1, 2, 3], dtype=object)

Tagging @fortune-max here as well as you've opened a PR for the issue :)

lilyminium avatar Apr 23 '22 08:04 lilyminium

Ok I see, its possible there are object topologyattrs that aren’t string, so hacking the internet mixin seems good yeah.

On Sat, Apr 23, 2022 at 09:19, Lily Wang @.***> wrote:

@richardjgowers https://github.com/richardjgowers I'm not sure you can blindly enforce dtypes without also giving a preferred actual type; the dtype of string arrays is set to object, which an integer also is. Other people have also wanted to set TopologyAttrs of arbitrary type as well, so we can't assume every object is a string. The fix for strings might need to edit _StringInternerMixin._set_X to just warn or error if a non-string is given.

The TypeError raised in the initial message isn't just because the dtype is wrong, it's because the value setter assumes the given value is either a string or a list. Setting multiple names works fine:

u.atoms[:3].names = [1, 2, 3] u.atoms[:3].names array([1, 2, 3], dtype=object)

Tagging @fortune-max https://github.com/fortune-max here as well as you've opened a PR for the issue :)

— Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2764#issuecomment-1107428175, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGBYL4DO2ZGUG322OVC3VGOXBHANCNFSM4N7PQFRA . You are receiving this because you were mentioned.Message ID: @.***>

richardjgowers avatar Apr 23 '22 09:04 richardjgowers

@richardjgowers I'm not sure you can blindly enforce dtypes without also giving a preferred actual type; the dtype of string arrays is set to object, which an integer also is. Other people have also wanted to set TopologyAttrs of arbitrary type as well, so we can't assume every object is a string. The fix for strings might need to edit _StringInternerMixin._set_X to just warn or error if a non-string is given.

The TypeError raised in the initial message isn't just because the dtype is wrong, it's because the value setter assumes the given value is either a string or a list. Then no further type checking is done. Setting multiple names works fine:

>>> u.atoms[:3].names = [1, 2, 3]
>>> u.atoms[:3].names
array([1, 2, 3], dtype=object)

Tagging @fortune-max here as well as you've opened a PR for the issue :)

Okay, nice. A warning would definitely be better. What about the issue of assigning floats to ids as stated above, should that also throw a warning? @richardjgowers

fortune-max avatar Apr 23 '22 12:04 fortune-max

  • @richardjgowers @lilyminium I have gone through the issue description, raised PRs, conversation and from that i understand this issue as to update the _set_X (located here) such that before setting the values we first make sure that the values are of specific type (as requirement).

  • _set_X is called from this classes:

    1. AtomStringAttr (located here)
    2. ResidueStringAttr (located here)
    3. SegmentStringAttr (located here)

My approach would be:

  • Update the _set_X function such that it will check for the required dtype (dynamically with passed dtype to the function).
def _set_X(self, ag, values, required_dtype):
        if not isinstance(values, required_dtype) and not all(isinstance(v, required_dtype) for v in values):
            raise TypeError(f"Values must be of type {required_dtype.__name__}")
  • In the above provided code we check for both multi values(i.e. list) and single values.
  • And when we call _set_X from above three classes then we pass the required dtype there like shown below.
class AtomStringAttr(_StringInternerMixin, AtomAttr):

    @_check_length
    def set_atoms(self, ag, values):
        return self._set_X(ag, values, str)
  • I understand that it is difficult for the maintainer to understand the solution from the message, but i don't want to make PR before clearly understanding the issue and guess for the solution.
  • Did i correctly understand the issue or any thing more i need to.
  • Is my solution is correct or partially correct ?
  • If you find the solution relevant can i make PR ?
  • Also do i need to write any test cases for this functionality if we proceed with methodology.

HeetVekariya avatar Jan 01 '24 10:01 HeetVekariya

@HeetVekariya you've got the string part of it very close! The overall issue here is that TopologyAttrs can have different types (which is set by the dtype property of each subclass, e.g. https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/topologyattrs.py#L540 )

When setting the attribute, we should check that the values match this dtype and raise an error if not. Right now for example, as mentioned in the top comment, setting resid to a float silently converts it to an integer, whereas we probably just want to raise an error instead. So a fix to this issue should work for all cases, i.e. in both the integer/float/etc dtypes, and the string dtypes (more on that below).

My comment (https://github.com/MDAnalysis/mdanalysis/issues/2764#issuecomment-1107428175) is referencing that some TopologyAttrs (e.g. Resnames https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/topologyattrs.py#L2738 ) have dtype object, but are supposed to be strings. Checking if basically anything is an object in Python will return True, so using an isinstance check would be quite ineffective with these string types. Luckily, these string TopologyAttrs are all subclasses of _StringInternerMixin, which uses the ._set_X, so my comment was suggesting that the .set_X method be modified. Since we know that _StringInternerMixin should have strings, therefore we already know that required_dtype argument you've suggested in your solution should be a string, and it's not necessary to pass that in.

Even in the other non-string classes, required_dtype should not be necessary since each class should have a dtype property containing this information already.

Also do i need to write any test cases for this functionality if we proceed with methodology.

Yes, this change should be tested thoroughly with separate cases for different types!

lilyminium avatar Jan 02 '24 22:01 lilyminium

@lilyminium Thank you for guiding me.

  • This is the updated approach,
def _set_X(self, ag, values):
    
    # Type checking logic
    if self.dtype is object:  # Special handling for string attributes
        if not (isinstance(values, str) or all(isinstance(v, str) for v in values)):
            raise TypeError("All values must be strings for string attributes")
    else:
        # Handling for non-string attributes
        if not (isinstance(values, self.dtype) or all(isinstance(v, expected_dtype) for v in values)):
            raise TypeError(f"All values must be of type {self.dtype.__name__}")

# Remaining code of _set_X function
  • Uses class variable dtype, rather than passing argument as required_dtype
  • Object type is handled explicitly to check for string
  • Sorry for asking question like this without creating PR 🙏

Question:

  1. For all three classes mentioned above, does not contains any dtype variables in it, so how it will work ? Does dtype for that classes is defined in the inherited classes of them ?
  2. How can i test my updated code in local for these changes ?

HeetVekariya avatar Jan 03 '24 12:01 HeetVekariya

Sorry for asking question like this without creating PR

No worries @HeetVekariya, it's better to have a clear approach and know what you want to do before creating PRs!

  1. Yes you're right, looks like we've never actually defined dtype in those classes, although subclasses do. I think any solution here can just assume that anything subclassing _StringInternerMixin should be a str. Do keep in mind that _set_X is only in the _StringInternerMixin class and all other non-string TopologyAttr classes would probably have a different solution, likely something to do with set_atoms, set_residues, set_segments etc.
  2. The user guide has a section on building a local development environment you can use to make changes, write tests, and run tests. (although looking over that now, I strongly recommend using mamba instead of conda, which can be very slow).

lilyminium avatar Jan 03 '24 13:01 lilyminium

No worries @HeetVekariya, it's better to have a clear approach and know what you want to do before creating PRs!

Thank you 🙏 :smile:

  • I think we should proceed with below mentioned solution as all the function call for _set_X is from the class which inherits _StringInternerMixin class.
def _set_X(self, ag, values):
        if not isinstance(values, str) and not all(isinstance(v, str) for v in values):
            raise TypeError("All values must be a string")

Question:

  1. You have mentioned set_atoms, set_residues and set_segments, i want to ask if you are talking about methods defined in the above mentioned classes ( set_atoms, set_residues, set_segments ) which makes function call to _set_X or which sets the values like set_atoms, set_residues, set_segments ?
  2. What do you think about my suggestion ?


  • If you talking about methods which makes function call _set_X then,
    • All of them are written for the classes which inherits _StringInternerMixin, so all of them would be having str dtype
    • If not str then we can pass required/expected dtype to _set_X as argument, which will handle it.
  • If i am missing something, can you please provide me information on it.

HeetVekariya avatar Jan 03 '24 14:01 HeetVekariya

@HeetVekariya the reason I bring up set_atoms, set_residues, etc. -- this issue is about all TopologyAttrs defined in topologyattrs.py. Not all of them subclass _StringInternerMixin and not all of them have string dtypes. Therefore, while your solution where you modify _StringInternerMixin._set_X will work for the string TopologyAttrs (Atomnames, etc) and is part of the solution, it won't work for all TopologyAttrs (Resids, etc).

Perhaps the easiest way for you to see what I mean is to write some tests that currently fail, but will pass with the right solution. For example, try the examples listed in the top comment https://github.com/MDAnalysis/mdanalysis/issues/2764#issue-639557421. In both those cases, the code @richardjgowers has given, should produce an error. Perhaps you could use pytest.raises to write a test that expects a particular error to be raised in those scenarios. There are examples of similar tests in the testsuite, e.g. here and here.

Writing tests should pass with your solution, but currently fail, is a great first step in opening a PR as it allows you to check that you're actually fixing the right behaviours -- it's basically test driven development.

lilyminium avatar Jan 03 '24 23:01 lilyminium

  • @lilyminium So i am thinking to make [WIP] PR for this issue as it will require more work, and a PR for 2604 as it does not require much work.
  • Will you allow me for this ?
  • By doing this i might get my hands dirty in writing test cases which will be good.

HeetVekariya avatar Jan 04 '24 04:01 HeetVekariya

@HeetVekariya yes sure, go ahead.

lilyminium avatar Jan 04 '24 22:01 lilyminium

@lilyminium Hi there ! Hope you are doing well.

  • I want your help in understanding the flow of working of the code for this issue.
  • I am trying to, but not able to understand how to tackle it, so can you guide me through a small example, it would be very helpful.

Thank you for your guidance till now.

HeetVekariya avatar Feb 11 '24 08:02 HeetVekariya

@HeetVekariya as @lilyminium said above: start a PR where you write a test that fails. Then go from there. The PR can be small and doesn’t have to solve the problem. But it’s a lot easier for others to comment on existing code than to talk about solutions in theory.

You already have a number of code contributions and are more experienced. The kind of comments that you got above are typically what we would consider sufficient for a developer to get started on a problem.

orbeckst avatar Feb 11 '24 15:02 orbeckst