Data Science
Quirks about running Rcpp on Windows through RStudio This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.
I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.
Bert Huang has a nice blog talking about poor results of ML/AI algorithms in “wild” data, which echos some of my experience and thoughts. His conclusions are worth thinking about, IMO.
1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected.
I was recently asked to do a panel of grouped boxplots of a continuous variable, with each panel representing a categorical grouping variable. This seems easy enough with ggplot2 and the facet_wrap function, but then my collaborator wanted p-values on the graphs! This post is my approach to the problem.
First of all, one caveat. I’m a huge fan of Hadley Wickham’s tidyverse and so most of my code will reflect this ethos, including packages and pipes.
Last month I published some thoughts on crowdsourcing research, inspired by Anthony Goldbloom’s talk at Statistical Programming DC on the Kaggle experience. Today, I found a rather similar discussion on crowdsourcing research (on the online version of the magazine Good) as a potential way to increase the accuracy of scientific research and reducing bias. I think more consideration needs to be made both by academia, funding agencies, journals and consumers of scientific and technological research to break silos and make progress accurate and reproducible, and finding new ways of preserving the profit imperative in technological progress that allows for the sharing and crowdsourcing of knowledge and research progress.
Last evening, Anthony Goldbloom, the founder of Kaggle.com, gave a very nice talk at a joint Statistical Programming DC/Data Science DC event about the Kaggle experience and what can be learned from the results of their competitions. One of the take away messages was that crowdsourcing data problems to a diligent and motivated group of entrepreneurial data scientists can get you to the threshold of extracting signal and patterns from data far more quickly than if a closed and siloed group of analysts worked on the problem.
This is an update to a previous post on reading fixed width formats in R.
A new addition to the Hadleyverse is the package readr, which includes a function read_fwf to read fixed width format files. I’ll compare the LaF approach to the readr approach using the same dataset as before. The variable wt is generated from parsing the Stata load file as before.
I want to read all the data in two columns: DRG and HOSPID.