Python
We have a data set dat with multiple observations per subject. We want to create a subset of this data such that each subject (with ID giving the unique identifier for the subject) contributes the observation where the variable X takes it’s maximum value for that subject.
R solutions Hadleyverse solutions Using the excellent R package dplyr, we can do this using windowing functions included in dplyr. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of X by choosing one of them.
Practical Data Science Cookbook My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC.
Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications.
I had a recent request to convert an entire folder of JPEG images into EPS or similar vector graphics formats. The client was on a Mac, and didn’t have ImageMagick. I discovered the Python Image Library to be enormously useful in this, and allowed me to implement the conversion in around 10 lines of Python code!!!
import Image from glob import glob jpgfiles = glob(’*.jpg’) for u in jpgfiles: out = u.
Excel is unfortunately the lingua franca of data delivery (at least in small amounts) from my collaborators. Often I have to merge several disparate bits of information from several Excel files together. I used to do this using R, since that’s what I’ve known for many years.
Now, of course, I’ve discovered Python!!! I fortunately discovered the excellent xlrd and xlwt packages by John Machin, and the subsequent addition of the xlutils package.
I’ve been fighting for some time to try and get Genz-Bretz’s method for calculating orthant probabilities in multivariate normal distributions imported into Python. I downloaded the fortran code from Alan Genz’s site and was unsuccessful in using f2py to link it with Python. However, I discovered the usefulness of the Python_ ctypes_ module in linking with shared libraries (see here). So, I compiled the fortran code using
gfortran mvtdstpack.f -shared -o libmvt.
I seem to be doing more and more with Python for work over and above using it as a generic scripting language. R has been my workhorse for analysis for a long time (15+ years in various incarnations of S+ and R), but it still has some deficiencies. I’m finding Python easier and faster to work with for large data sets. I’m also a bit happier with Python’s graphical capabilities via matplotlib, which allows dynamic updating of graphs _a la _Matlab, another drawback that R has despite great graphical capabilities.