Computation

Some thoughts on the downsides of current Data Science practice

Bert Huang has a nice blog talking about poor results of ML/AI algorithms in “wild” data, which echos some of my experience and thoughts. His conclusions are worth thinking about, IMO. 1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected.

The need for documenting functions

My current work usually requires me to work on a project until we can submit a research paper, and then move on to a new project. However, 3-6 months down the road, when the reviews for the paper return, it is quite common to have to do some new analyses or re-analyses of the data. At that time, I have to re-visit my code! One of the common problems I (and I’m sure many of us) have is that we tend to hack code and functions with the end in mind, just getting the job done.

Converting images in Python

I had a recent request to convert an entire folder of JPEG images into EPS or similar vector graphics formats. The client was on a Mac, and didn’t have ImageMagick. I discovered the Python Image Library to be enormously useful in this, and allowed me to implement the conversion in around 10 lines of Python code!!! import Image from glob import glob jpgfiles = glob(’*.jpg’) for u in jpgfiles: out = u.

SAS, R and categorical variables

One of the disappointing problems in SAS (as I need PROC MIXED for some analysis) is to recode categorical variables to have a particular reference category. In R, my usual tool, this is rather easy both to set and to modify using the relevel command available in base R (in the stats package). My understanding is that this is actually easy in SAS for GLM, PHREG and some others, but not in PROC MIXED.

Forest plots using R and ggplot2

Forest plots are most commonly used in reporting meta-analyses, but can be profitably used to summarise the results of a fitted model. They essentially display the estimates for model parameters and their corresponding confidence intervals. Matt Shotwell just posted a message to the R-help mailing list with his lattice-based solution to the problem of creating forest plots in R. I just figured out how to create a forest plot for a consulting report using ggplot2.

A small customization of ESS

JD Long (at Cerebral Mastication) posted a question on Twitter about an artifact in ESS, where typing “” gets you “<-“. This is because in the early days of S+, “” was an allowed assignment operator, and ESS was developed in that era. Later, it was disallowed in favor of “<-” and “=”, so ESS was modified to map “_” to “<-“. Now I like the typing convenience of this map, and I don’t use underscores in my variable names, so I was fine.

Quick and dirty parallel processing in R

R has some powerful tools for parallel processing, which I discovered while searching for ways to fully utilize my 8-core computer at work. What surprised me is how easy it is…about 6 lines of code, if that. Given that I wasn’t allowed to install heavy duty parallel-processing systems like MPICH on the computer, I found that the library SNOW fit the bill nicely through its use of sockets. I also discovered the libraries foreach and iterators, which were released to the community by the development team at Revolution R.

Floating point pitfalls

Easy (?) way to tack Fortran onto Python

Workflow with Python and R

I seem to be doing more and more with Python for work over and above using it as a generic scripting language. R has been my workhorse for analysis for a long time (15+ years in various incarnations of S+ and R), but it still has some deficiencies. I’m finding Python easier and faster to work with for large data sets. I’m also a bit happier with Python’s graphical capabilities via matplotlib, which allows dynamic updating of graphs _a la _Matlab, another drawback that R has despite great graphical capabilities.