We’re doing time series analysis on text log files that contain about 20 million lines per day, with some complexities:
- Timestamps are separate observations from the rest of the data – each timestamp applies to all the observations until the next timestamp appears. So already we have a challenge that R isn’t built for, since those timestamps need to be rolled forward in a loop-type structure.
- We don’t need all the data. The raw files contain output from several different samples, and our series of interest is only about 20% of the overall data. But those series are defined by values in the data that we don’t know beforehand, so we have to read all the data first, then keep the appropriate lines.
In a situation like this, it’s best to test code on a smaller sample of observations first.
But even for just 10,000 lines, it was taking several minutes to read and process into a well-organized data.frame, even using the
readr package. Scaling that up 20x just to clean a single day of records made no sense – we’d spend weeks just assembling a data set. There had to be a better way.
We looked at Julia and Nim, but weren’t sure it was worth bringing in an entire general purpose programming language for reshaping a data file, although there was potential to do some of the time series analysis in Julia.
Then we came back to awk, a lightweight text-processing programming language and utility that’s been around since the 1970s and is available on almost every UNIX-based system (meaning there would be no need to worry about Julia/Nim availability and version consistency).