Complexity of David

Data Science, Machine Learning, Artificial Intelligence, Visualization, and Complex Systems.

Working With Data and Python

Working with data, in any form, is the XXI century toolkit for any computation based job. And although I’m a fan of R for much of my data analysis work, python comes as a close second. And if you favor python then, there are a some aspects to consider:

  • Using pandasA Gentle Visual Intro to Data Analysis in Python Using Pandas
  • numpy — a workhorse of data manipulation. If you have data that needs linear algebra, you’ll use numpy—even if numpy is a dependency of other package (e.g. pandas uses numpy). Check the quick start of numpy if your starting out. Also, interacting with C,C++, or fortran routines is a must.
  • If you need to use Tensorflow, python has you covered.
  • Working with Jupyter Notebooks — Everybody seems to love them. I use them for EDA (exploratory data analysis) and prototyping small ideas, but not for long complicated sessions. The burden of knowing the state of computation makes it daunting for long sessions. This comes in part from the fact that you can re-run code cells in out of page flow order. This means that you have to consider if the output you are getting is affected by code ran in another cell. But if you can’t code in a code editor, then coding in a browser might work for you. The most interesting thing in Jupyter are the different kernels. They allow you to experiment with other languages in an unified interface.
  • Scikit way — Another ML library that runs on top of numpy. See a trend here?
  • All this data is nice and all that, but you need to visualize it or it won’t be of much help. For this Matplotlib is king. Almost every package will ask you to import matplotlib.pyplot as plt so you can plot your data with python. It has come a long way and is a mature plotting solution.