Personal expressiveness, or how data is stored in a spreadsheet When you get data from a broad research community, the variability in how that data is formatted and stored is truly astonishing. Of course there are the standardized formats that are output from machines, like Next Generation Sequencing and other automated systems. That is a saving grace!
But for smaller data, or data collected in the lab, the possibilities are truly endless!
Quirks about running Rcpp on Windows through RStudio This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not.
I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.
Chris Moffit has a nice blog on how to use the transform function in pandas. He provides some (fake) data on sales and asks the question of what fraction of each order is from each SKU.
Being a R nut and a tidyverse fan, I thought to compare and contrast the code for the pandas version with an implementation using the tidyverse.
First the pandas code:
import pandas as pd dat = pd.
The package ReporteRs has been getting some play on the interwebs this week, though it’s actually been around for a while. The nice thing about this package is that it allows writing Word and PowerPoint documents in an OS-independent fashion unlike some earlier packages. It also allows the editing of documents by using bookmarks within the documents.
This quick note is just to remind me that the structure of ReporteRs works beautifully with the piping conventions of magrittr.
We have a data set dat with multiple observations per subject. We want to create a subset of this data such that each subject (with ID giving the unique identifier for the subject) contributes the observation where the variable X takes it’s maximum value for that subject.
R solutions Hadleyverse solutions Using the excellent R package dplyr, we can do this using windowing functions included in dplyr. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of X by choosing one of them.
If you have ever worked with US government data or other large datasets, it is likely you have faced fixed-width format data. This format has no delimiters in it; the data look like strings of characters. A separate format file defines which columns of data represent which variables. It seems as if the format is from the punch-card era, but it is quite an efficient format to store large data in (see this StackOverflow discussion).
Practical Data Science Cookbook My friends Sean Murphy, Ben Bengfort, Tony Ojeda and I recently published a book, Practical Data Science Cookbook. All of us are heavily involved in developing the data community in the Washington DC metro area, serving on the Board of Directors of Data Community DC. Sean and Ben co-organize the meetup Data Innovation DC and I co-organize the meetup Statistical Programming DC.
Our intention in writing this book is to provide the data practitioner some guidance about how to navigate the data science pipeline, from data acquisition to final reports and data applications.
My current work usually requires me to work on a project until we can submit a research paper, and then move on to a new project. However, 3-6 months down the road, when the reviews for the paper return, it is quite common to have to do some new analyses or re-analyses of the data. At that time, I have to re-visit my code!
One of the common problems I (and I’m sure many of us) have is that we tend to hack code and functions with the end in mind, just getting the job done.
About 3 years ago I published some code on this blog to draw a Kaplan-Meier plot using ggplot2. Since then, ggplot2 has been updated (from 0.8.9 to 0.9.3.1) and has changed syntactically. Since that post, I have also become comfortable with Git and Github. I have updated the code, edited it for a small error, and published it in a Gist. This gist has two functions, ggkm (basic Kaplan-Meier plot) and ggkmTable (enhanced Kaplan-Meier plot with table showing numbers at risk at various times).
I have always been provided SAS as part of my job, so I never really realized how much it cost. I’ve bought Stata before, and of course R :). I recently found out how much a reasonable bundle of SAS modules along with base SAS costs per year per seat, at least under the GSA. I tried finding out how much IBM SPSS is for a comparable bundle, but their web page was “not available”.