Initial Stages
We received an email early one morning:
I’m looking for a consultant to help design a Shiny program to allow users to create simple flexible tables from household survey data, perhaps mimicking some of the functionality of Statwing.
This particular client was the research department of a global bank group, and a pretty impressive sign at the strides Shiny had made in attracting users with big footprints.
Over the next few emails, we learned more about the requirements. The application needed to allow the user to:
- Define the universe through filters (based on variables equals to a value, falling in a range, or containing or not containing a set of categories – the default variables would be country and year, but users could impose other conditions like employment status=working, or geographic region=Bihar)
- Define a row variable, including constructing new categorical variables from the underlying data
- Define a column variable
- Optionally, define superrows and perhaps supercolumns.
- Define the contents of cells – mean, median, standard deviation, or Gini coefficient coefficient, or share of row or column.
- Allow selection of pre-defined tables from a list of preset options. Selecting presets would automatically select parameters to create certain tables.
- The tables would be easily copyable to Excel, where users could make graphs. Or if there is time, R could also draw nice graphs.
That’s a pretty good start! We got down to business drawing up a prototype, and the results were good for a basic starting point.
Moving Online
With a working local version, it was time to move the application onto the bank’s development environment servers.
After a quick and productive meeting with the IT staff, we agreed to run Shiny Server on a CentOS system. They installed Shiny Server along with the required packages, and once we had access, installation was a breeze.
Further Development
The client needed to make this tool clearly advantageous over other methods for summarizing data – if it wasn’t, why bother? We talked through several really cool additional features:
- Defining custom variables within the application. For example, it can be difficult to analyze by age since there are so many unique values. But if you can quickly divide age into quartiles or deciles on the same screen, that becomes a much simpler task.
- Saving and restoring tables. This wasn’t quite designed as a persistence layer, more of a way to rapidly share table specifications . We decided to output all table inputs (including custom variable definitions) as a JSON string, so a user could quickly send it to a colleague to replicate. No screensharing or tedious input-checking required.
- Pre-defined table library. Several tables in particular would be commonly used by the research group, so we decided to define them formally such that they could be generated with a single click. This dramatically sped up their workflow.
- But we didn’t want every user to be able to edit the library, so we attached an administrator authentication system. It’s nothing fancy, but it keeps everyday users from breaking things.
The client also wanted to focus on speed, which was an issue given the size of the data sets to be analyzed, so we made two main improvements.
First, we pre-loaded the data into memory on application startup and changed the application idle timeout to the maximum in the Shiny Server configuration file.
Second, we translated all data.frame operations to use the data.table package. Wow. The hype was real. Summary stat calculations that took 15-20 seconds using base R data.frame operations were now running in less than 1 second.
Lessons Learned
As of now, the project is on hold as the client’s management decides whether to roll this out to bigger and publicly-accessible data sets. We’ll see! We’ve been talking about the possibility of converting it to openCPU to simplify and reduce the cost of scaling – could be pretty cool.
But still, several lessons came out of this:
- Collaboration matters. Researchers rarely work alone, and even if they do, they want to show other people their work. It’s much easier to confer about a given analysis by sending a short text string to a colleague rather than going through inputs one by one to make sure you’re accurately replicating the work.
- Keep working on the UI, even when you know it well. The client never gave up on making the UI flow naturally and efficiently. We spent a lot of time discussing proper labeling, ordering, and what should be hidden until explicitly required by the user. Our decisions really helped the final product.
- Focus on speed before you need it. We tested with a tiny subset of the full data, which made things easy…until we needed to convert everything to data.table syntax on a relatively short schedule. Again, data.table is great – but it’s even better when you have time to get it right the first time.