Shiny, DataTables, and replaceData

Some of the Shiny apps we’ve developed for a financial trading firm have pretty consistent requirements:

  1. Show a lot of data, and highlight the important stuff.
  2. Update that data frequently.

The DataTables library makes a lot of sense for those. It’s easy to use – just pop in a data.frame and by default you have a sortable, searchable, page-able table with a nice default style.

But then we noticed the client-side (browser) memory leak!

Over the course of 20 minutes, we saw the Javascript heap grow from 10-15 MB to 100+ MB, and keep on going.

Eventually the browser tab would crash, and the app had to be restarted.

After investigating, we noticed the server-side rendering function DT::renderDataTable actually creates a NEW instance of the DataTable every time it updates, rather than just updating the data in the existing table.

For most apps, this isn’t a big deal – a table might update 2, 3, even 10 times based on user requests, but never enough to crash a tab.

But for these apps, updating 12 DataTables every 3 seconds means 240 new instances of DataTables per minute! These apps are intended to open and running throughout an 8-hour trading day, so that was unacceptable.

It was not clear that generating new table instances was the default behavior, but thankfully there was a clear solution in RStudio’s documentation of the DT package. See section 2.3 of this page, or their example Shiny app here.

We just needed to create the DataTable once on initialization with DT::renderDataTable, and then create a dataTableProxy to continue to interact with that same table instance.

After that, we just need to put an observe in place to listen for reactive data updates on the server side, and call replaceData to insert the new data.

Simple! And easy on the memory!

There is one catch: this did not work well with data.frames that require use of future() and %...%T to split processing into another R session. That’s because the initial table gets processed with data as a NULL, and then subsequent updates (data.frames) have different columns, of course. So it may be possible, but wasn’t immediately obvious how to handle that.

 

Old Tools for New Problems – Using Awk to Transform Log Files into Analysis Data

We’re doing time series analysis on text log files that contain about 20 million lines per day, with some complexities:

  1. Timestamps are separate observations from the rest of the data – each timestamp applies to all the observations until the next timestamp appears. So already we have a challenge that R isn’t built for, since those timestamps need to be rolled forward in a loop-type structure.
  2. We don’t need all the data. The raw files contain output from several different samples, and our series of interest is only about 20% of the overall data. But those series are defined by values in the data that we don’t know beforehand, so we have to read all the data first, then keep the appropriate lines.

In a situation like this, it’s best to test code on a smaller sample of observations first.

But even for just 10,000 lines, it was taking several minutes to read and process into a well-organized data.frame, even using the readr package. Scaling that up 20x just to clean a single day of records made no sense – we’d spend weeks just assembling a data set. There had to be a better way.

We looked at Julia and Nim, but weren’t sure it was worth bringing in an entire general purpose programming language for reshaping a data file, although there was potential to do some of the time series analysis in Julia.

Then we came back to awk, a lightweight text-processing programming language and utility that’s been around since the 1970s and is available on almost every UNIX-based system (meaning there would be no need to worry about Julia/Nim availability and version consistency).

Getting started was easy, and thanks to a few short tutorials on advanced techniques, we wrote a 32-line script that:

  1. Rolled timestamps forward to all observations.
  2. Decoded the series identifiers and dropped observations from series that we don’t need.
  3. Extended all observations to have the same set of variables, so the output file could be read as CSV.

The awk script needs only 15 seconds to process a file with 20 million records, and the subsequent conversion from CSV to RDS creates a file that’s about 3.5% the size of the original.

This is a good example of using the right tool for the job, and we’ll definitely rely on awk more in the future for reshaping and processing text files.

Speeding Up A Shiny App: Future/Promise and Caching

We build a lot of Shiny apps, and most work fine without a lot of customization.

But special cases require some fine-tuning to get everything working correctly, especially with a lot of simultaneous users.

One recent app was built for a financial trading firm, and needed to be open and responsive for a large set of traders all day long.

But there was a major barrier: every five seconds, data needed to be pulled from a database on another continent back to the US, transformed with some relatively processor-intensive steps, and only then were results displayed on the screen…maybe only a second or two before the next update cycle would begin.

For one user, no problem – but for two, three, four, …, users? When using the open source version of Shiny Server, all sessions are working in a single R process, which means each session has to wait for other sessions to update first.

As soon as the processing for all sessions takes longer than the five-second update interval, you can think of this like a line (for movie tickets, stadium entrance, grocery check-out, etc), where for every person finishing up a the front of the line, more than one person is getting into the back of the line.

There’s no way for the server to catch up!

The first approach is to move long-running steps into separate R processes, which is relatively easy to do with the async functionality added to recent versions of Shiny along with the promises and future packages.

There’s a helpful blog post here from RStudio about how to implement async processing in a Shiny app with a basic example. We found this really helpful as a starting point.

So that helped the app quite a bit – instead of running into timing limits at two or three concurrent sessions, we could get up to the number of processors on the server, but then hit another wall.

That’s when we decided to combine the async approach with another nice feature of the maturing Shiny ecosystem: the in-memory cache.

In this case, the data coming from the database, and the transformed results of that data, are going to be the same for every session. Only minor changes are different across sessions, like filter parameters (e.g. only show orders great than $X).

That means all the heavy lifting can be done once, by whichever session was started first, and then saved in the cache, and all other sessions can simply read from the cache.

How do we keep track of live sessions? We created a global object called session_tokens, initialized as character(0).

Then within the server function:

if (!(session$token %in% session_tokens)) {
  session_tokens <<- c(session_tokens, session$token)
}

After that, every step that should be processed only once looks like this:

if (session$token == session_tokens[1]) {
  message(' -- setting cache: ', session$token)

  # heavy lifting

  cache$set('key', value)
} else {
  message(' -- reading cache: ', session$token)
  value <- (cache$get('key'))
}

This does cause a slight delay for every session reading from the cache, since obviously those sessions have to wait for the pilot session to finish processing first.

However, that slight delay is a small price to pay to avoid an ever-growing session queue, or maxing out all processors on the server for two out of every five seconds.

There is one more thing to do, which is to remove session tokens when each session ends – this way if the current first session ends, the next one can become the session doing the work:

onSessionEnded(function() {
  message(' -- ending session ', session$token)
  session_tokens <<- setdiff(session_tokens, session$token)
})

This was a fun challenge, and yet another reminder of how powerful Shiny apps can be these days, a great improvement over the somewhat limited world of three or four years ago.

Shiny, Google Maps, and Voronoi Diagrams

There was a comment on a recent Hacker News thread about a world airport Voronoi map that said “if only there was a webpage/software where someone could click/select points on a map…and a user Voronoi diagram would be created ;-).”

I knew such a tool already existed, but I thought I might as well try to implement one myself, so I put together the pieces:

  1. The deldir R package can quickly create Voronoi lines for a given set of two-dimensional points.
  2. Google Maps, since I knew how to handle events like clicking to add a point, dragging the points, and double-clicking to remove. Plus it’s easy to draw the Voronoi lines.
  3. Shiny can link that quick calculation of Voronoi lines with the front-end maps library API, so that user events and the server-side data stay in sync.

A live demo version is available on our Shiny demo server: https://shiny.modernresearchconsulting.com/create-voronoi

create-voronoi

The user interface is pretty simple – just click to add points, drag to change, and double-click to remove, and the Voronoi will update automatically.

There’s also an option to change the lines from straight to geodesics (following the curve of the earth).

The code is on Github and MIT-licensed.

I’d love to add a way to load sets of point at once (US state capitals, 100 tallest mountains, sports arenas, etc) when I have more time to work on this.

Using NextJournal for World Cup Data Analysis

I have a couple of contacts working on the new-and-improved version of NextJournal.

I tried it out about a year ago when they first launched, and thought it was pretty cool, but never got a real project going.

So when I got an email about their re-launch, I decided to actually put some time into learning how it works.

So far I’ve published two articles of data analysis and visualization with World Cup historical data. Here are links to check it out:

https://nextjournal.com/mdec/world-cup-host-advantage-and-curse-of-the-champion

https://nextjournal.com/mdec/world-cup-scoring-analysis

Full disclosure: I was paid for some of the time spent working on these NextJournal articles, but I was not asked to write this blog post, nor to express any opinions, positive or negative.

 

 

 

 

Winston-Salem Police Response Data

The previous post was all about an open data / open source strategy.

There’s plenty of data available from public sites that can be turned into useful tools, and one of the sources we’ve focused on recently is police response data.

Winston-Salem Police Department publishes a simple text file daily, containing information about all the responses from the previous day: report number, address, time of day, and the type of issue (for example, vandalism, motor vehicle theft, arson, etc.).

But the interface isn’t very useful: no aggregation, no filtering, no visualization, nothing but daily text files.

WSPD does contract with crimemapping.com to display individual responses on a map, which can be filtered going back several months, and users can even receive email alerts for activity within a specified radius of any address.

wspd-crimemapping

That’s great! But of course we wondered if an open-source tool would be possible.

So we’ve released a minimal version of a Shiny dashboard as an example:

https://shiny.modernresearchconsulting.com/wspd-response-dashboard/

And we’ve also released the source code here:

https://github.com/modernresearchconsulting/wspd-response-dashboard

Continue reading “Winston-Salem Police Response Data”

Open Data and Cool Maps

We’ve started offering clients open data and open source strategy consulting, and hope to have some case studies in the coming months.

Releasing data and software (or even both) to the public seems counterproductive to good business – how could it do anything but feed the competition, or draw attention to mistakes?

It’s worth thinking one step deeper:

  1. Releasing code and data can be a conspicuous demonstration of competence, a great way to draw incoming leads and even recruit to your organization.
  2. It also signals that your value is greater than the contents of files – you’re not worried about giving away the data, because where you really shine is customization and customer service, which is much harder to copy.
  3. In a world where a million companies are selling software and services for a monthly subscription, there’s a sharp discontinuity in user acquisition at $0. Start them for free, then offer upgrades in the form of additional features or support.
  4. Competitors may learn from your data and/or your code, but that’s a short-term effect – if the overall market grows, that’s good for you, even if your competitor gets a chunk of that market.
  5. If you have mistakes in your code or your data, you want to know as soon as possible, and what better way than to increase use by offering it for free – it’s almost like outsourced testing, and in our experience, fixing a publicly known bug or inaccuracy with grace and polite thanks to the finder does no harm to your reputation.

We talk about #4 a lot – if we’re unable to fit a project into our schedule, we don’t hesitate to refer to competitors. If there are more groups delivering successful projects in our field, that’s great!

Continue reading “Open Data and Cool Maps”