We’re doing time series analysis on text log files that contain about 20 million lines per day, with some complexities:
- Timestamps are separate observations from the rest of the data – each timestamp applies to all the observations until the next timestamp appears. So already we have a challenge that R isn’t built for, since those timestamps need to be rolled forward in a loop-type structure.
- We don’t need all the data. The raw files contain output from several different samples, and our series of interest is only about 20% of the overall data. But those series are defined by values in the data that we don’t know beforehand, so we have to read all the data first, then keep the appropriate lines.
In a situation like this, it’s best to test code on a smaller sample of observations first.
But even for just 10,000 lines, it was taking several minutes to read and process into a well-organized data.frame, even using the
readr package. Scaling that up 20x just to clean a single day of records made no sense – we’d spend weeks just assembling a data set. There had to be a better way.
We looked at Julia and Nim, but weren’t sure it was worth bringing in an entire general purpose programming language for reshaping a data file, although there was potential to do some of the time series analysis in Julia.
Then we came back to awk, a lightweight text-processing programming language and utility that’s been around since the 1970s and is available on almost every UNIX-based system (meaning there would be no need to worry about Julia/Nim availability and version consistency).
Getting started was easy, and thanks to a few short tutorials on advanced techniques, we wrote a 32-line script that:
- Rolled timestamps forward to all observations.
- Decoded the series identifiers and dropped observations from series that we don’t need.
- Extended all observations to have the same set of variables, so the output file could be read as CSV.
The awk script needs only 15 seconds to process a file with 20 million records, and the subsequent conversion from CSV to RDS creates a file that’s about 3.5% the size of the original.
This is a good example of using the right tool for the job, and we’ll definitely rely on awk more in the future for reshaping and processing text files.