examples-in-python sample code errors

@thoughtfulml, this book looks very promising. As far as I can tell, the sample code (of chapter 3 and 4) do not appear to be the final working version; running the code produced errors. This is true for the code in the book and on github. I did not look at the rest of the chapters.

It would you great if you would make sure that the correct version of the code is on github.

Apr 15 '17 15:04 saun4app

Hey there,

so I've been working up some errata changes for the second pressing of the book. Is there anything specific that you can identify or is it a general comment?

Thanks for sending me a note really appreciate it!

-Matt

Apr 19 '17 14:04 hexgnu

@hexgnu,

Thank for your reply. I did not get very specific because, basically, running unit tests (advised by the book) should identify the problems.

Chapter 3: the re-assignment of variable df (dataframe) seem to be problematic (or confusing),

df = (df - df.mean()) / (df.max() - df.min()) # is the result (df) a dataframe?
followed by self.df = df # dataframe


class Regression:
  def __init__(self, csv_file = None, data = None, values = None):
    if (data is None and csv_file is not None):
      df = pd.read_csv(csv_file)
      self.values = df['AppraisedValue']
      df = df.drop('AppraisedValue', 1)
      df = (df - df.mean()) / (df.max() - df.min())
      self.df = df
      self.df = self.df[['lat', 'long', 'SqFtLot']

    elif (data is not None and values is not None):
      self.df = data
      self.values = values
    else:
      raise ValueError("Must have either csv_file or data set")

    self.n = len(self.df)
    self.kdtree = KDTree(self.df)
    self.metric = np.mean
    self.k = 5

Chapter 4: the variable part is undefined.

import email
from BeautifulSoup import BeautifulSoup

class EmailObject:
  def __init__(self, filepath, category = None):
    self.filepath = filepath
    self.category = category
    self.mail = email.message_from_file(self.filepath)

  def subject(self):
    return self.mail.get('Subject')

  def body(self):
    content_type = part.get_content_type()
    body = part.get_payload(decode=True)

    if content_type == 'text/html':
     return BeautifulSoup(body).text
    elif content_type == 'text/plain':
     return body
    else:
     return ''

Apr 19 '17 14:04 saun4app

hey @saun4app what version of python are you using? I've only tested this on 2.7 and have been planning on updating things to work on 3.5 for the reprint.

Also yea I know about some of these errors they have been submitted to the errata which I'm batching together to fix all at once.

Thanks!

Apr 25 '17 14:04 hexgnu

df = (df - df.mean()) / (df.max() - df.min()) # this is a pandas dataframe
self.df = df # dataframe # this is also a dataframe

df.mean() df.max() and df.min() are all scalar numbers so this ends up being a scalar normalization constant. Does that make sense? I can point to it more clearly in the book for the next reprint.

Apr 25 '17 14:04 hexgnu

Also where did you get the code for the naive bayesian part?

In this repo I see:

  def body(self):
    payload = self.mail.get_payload()
    parts = []
    if self.mail.is_multipart():
      parts = [self.single_body(part) for part in list(payload)]
    else:
      parts = [self.single_body(self.mail)]
    return self.CLRF.join(parts)
      
  def single_body(self, part):
    content_type = part.get_content_type()
    body = part.get_payload(decode=True)

    if content_type == 'text/html':
      return BeautifulSoup(body).text 
    elif content_type == 'text/plain':
      return body
    else:
      return ''

Are you cutting and pasting directly out of the book?

Apr 25 '17 14:04 hexgnu

@hexgnu, I am using Python 3.x. The code is from the book (between Table 4-3 and Figure 4-3).

Apr 25 '17 16:04 saun4app

yea I've been planning on updating so that everything works under 3.x.

Do you want me to ping you when everything is updated?

Might take me a month but I've gotten enough feedback about everybody using 3.x that it seems warranted to get it working under both. I'll probably add travisci to this to test the different repos as well.

Apr 25 '17 16:04 hexgnu

@hexgnu, I appreciate you are invested in keeping the book correct and current. Please keep me posted with your progress. Using travisci for the sample code is a great idea. Please let me know if/when you switch to a different repo. I hope to be able to access to the updated/working example, before the entire book is updated. Thank you very much.

Apr 25 '17 16:04 saun4app

Yea the github repo will be kept up to date. As I said in the book look to the Github repo. Unfortunately print doesn't have a way of accepting git changes yet ;)

What did you think about the book otherwise? Did you like it? I'm looking for feedback to roll into the next reprint. Also feel free to drop me an e-mail if you want ([email protected]).

Apr 25 '17 16:04 hexgnu

Matt, your book looks very promising. You appear to be a thoughtful, principled, and knowledge programmer. At the moment, I am disappointed by experiencing obstacles that prevent simple examples to run (or compile); inconsistent with the "Writing Software Right" principle. I have not spent any time pass Chapter 4. Now that you have pointed out the the repo has working version of Chapter 4. Naive Bayesian Classification, I will re-look at chapter 4, and get back with you.

Apr 25 '17 16:04 saun4app

@hexgnu Regarding Chapter 3, there are two issues preventing this from working with Python 3.x:

Several lines in the data set have a string like "Geocoder::Result::Bing" where the latitude/longitude numbers should be (if you run df.info() the lat/long fields show up as objects instead of floats!). It's easy enough to fix this using a different Python script to write to a new data set file omitting the bogus lines. I'm guessing Pandas in Python 3.x isn't skipping the non-numeric data in these lines even though the documentation suggests that it should be.
Line 63 in the sample code should instead read: test_rows = random.sample(self.df.index.tolist(), int(round(len(self.df) * holdout)))

After making the changes above, I get a nice shiny graph that slightly varies from the one in the book. Happy to send you a PR if you like. Also, it might be worth mentioning how to run this code. I added the following to the end of my script file:

if __name__ == '__main__':
    r = Regression("king_county_data_geocoded_fixed.csv")
    r.plot_error_rates()

May 14 '17 08:05 nwautomator

@nwautomator a PR would definitely be welcome!

I'm going to actually be going through all these examples soon to cut a reissue of the book.

thanks!

May 14 '17 15:05 hexgnu