1 Intro: dplyr::group_walk()

Earlier today I stumbled upon a Tweet by mikefc/@coolbutuseless which led to two serendipitous findings:

  1. It is now proven that Hadley Wickham does use the controversial reverse assignment ->.

Just see for yourself in Hadley’s Slides and maybe archive the slide as a screenshot… From now on, no one can ever blame me for using ->!

  1. But far more important: the freshly released dplyr 0.8 includes a whole bunch of new group_verb() functions such as group_split(), group_map(), and group_walk(). See the full changelog for all the new goodies and fixes.

At that moment I was already soooo tempted to write a script which splits a huge df into several CSV files (legit use case at work; JSON would be a nice extra). Basically, I already used purrr::map_dfr() once to read in a whole bunch of .rds files, so there should exist a method for doing the reverse. But then – rather by incident – I discovered the group_map() vignette being linked at the bottom of the group_split() vignette, which also happened to include group_walk() AND the basic recipe for a CSV output function… Serendipity indeed.

All it took me from there was to “tidify” the syntax a little and add a fix for the dropped grouping variable. I posted the result of my excitement on Twitter, and added a jsonlite solution for JSON output. The post went micro-viral, and @dreznik suggested further improvements:

So here we go:

library(tidyverse)
library(gapminder) # data for the demo

Here’s what the gapminder data looks like. It’s a regular df with 1704 observations for 142 countries.

gapminder %>% head() %>% knitr::kable("html")
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134
gapminder %>% 
  skimr::skim_to_wide() %>%
  select(-top_counts, -(p0:p100)) %>% 
  knitr::kable("html", 1)
type variable missing complete n n_unique ordered mean sd hist
factor continent 0 1704 1704 5 FALSE NA NA NA
factor country 0 1704 1704 142 FALSE NA NA NA
integer pop 0 1704 1704 NA NA 3e+07 1.1e+08 ▇▁▁▁▁▁▁▁
integer year 0 1704 1704 NA NA 1979.5 17.27 ▇▃▇▃▃▇▃▇
numeric gdpPercap 0 1704 1704 NA NA 7215.33 9857.45 ▇▁▁▁▁▁▁▁
numeric lifeExp 0 1704 1704 NA NA 59.47 12.92 ▁▂▅▅▅▅▇▃

2 Vectorized CSV Output

path for the output folder

First, we check if the desired output folder exists, and create the folder if that’s not the case. Note that I’m using here::here() to create a relative path for my blogdown website’s data folder. You could stick to the uncommented "output_csv" line without creating a csv_path object if you wish.

csv_path <- here::here("data", "output_csv")
# if(!dir.exists("output_csv")) {dir.create("output_csv")}
if (!dir.exists(csv_path)) {dir.create(csv_path)}

We could also use the tidier fs methods for file-system access (again, per @dreznik), but I guess it’s also ok to stick to some comfy Base R habits.

CSVs

Now we group the gapminder data by the country variable (remember: 142 unique countries), add an additional country_name column to preserve the country names (they get dropped during group_walk(), and keep=TRUE does not work for this verb), place the new country_name to the front and write the groups as CSVs to our csv_path. Again, kudos to @dreznik for the more robust .y[["country"]] evaluation hint!

gapminder %>%
  group_by(country) %>%
  mutate(country_name = country) %>% # to preserve country name in csv object...
  select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
  group_walk(~ write_csv(.x, path = fs::path(csv_path, .y[["country"]], ext = "csv")))

Output

Let’s inspect the output folder:

list.files(csv_path, pattern = "csv$") %>% head(20)
##  [1] "Afghanistan.csv"            "Albania.csv"               
##  [3] "Algeria.csv"                "Angola.csv"                
##  [5] "Argentina.csv"              "Australia.csv"             
##  [7] "Austria.csv"                "Bahrain.csv"               
##  [9] "Bangladesh.csv"             "Belgium.csv"               
## [11] "Benin.csv"                  "Bolivia.csv"               
## [13] "Bosnia and Herzegovina.csv" "BosniaandHerzegovina.csv"  
## [15] "Botswana.csv"               "Brazil.csv"                
## [17] "Bulgaria.csv"               "Burkina Faso.csv"          
## [19] "BurkinaFaso.csv"            "Burundi.csv"

A Shocker for later: fs::dir_map(csv_path, read.csv)

And now a single CSV file:

read_csv(fs::path(csv_path, "Afghanistan.csv")) %>% head() %>% knitr::kable("html")
country_name continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134

frie /@ameisen_strasse rightly pointed out that these file names contain white space, so we might want to parse .y[["country"]] with stringr::str_replace_all(). This way we can either replace all the white spaces with _ or CamelCase the file names. Also, have a look at Congo and Congo DR. We might want to account for that naming pattern too.

gapminder::gapminder %>% filter(str_detect(country, "Congo")) %>% distinct(country)
## # A tibble: 2 x 1
##   country         
##   <fct>           
## 1 Congo, Dem. Rep.
## 2 Congo, Rep.

Here’s the " | . | , " to CamelCase workaround. We can use the explicit " |\\.|," pattern. But of course, we might want to use something more sophisticated such as [:punct:]|[:blank:] or even the super strict [^a-zA-Z]:

gapminder %>%
  group_by(country) %>%
  mutate(country_name = country) %>% # to preserve country name in csv object...
  select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
  group_walk(~ write_csv(.x, path = fs::path(csv_path, str_replace_all(.y[["country"]], "[^a-zA-Z]", ""), ext = "csv")))

And have you already heard of the new stringr::str_squish() method which collapses multiple white spaces into a single white space (which we then can str_replace_all(., "[^a-zA-Z]", "_") to any_pretty_file_name.csv)!?

Nice! But what if we want … JSON files!?

3 Vectorized JSON Output

jsonlite is already part of the tidyverse, so let’s see how that works:

library(jsonlite)

same procedure, different path

json_path <- here::here("data", "output_json")
# if(!dir.exists("output_csv")) {dir.create("output_csv")}
if (!dir.exists(json_path)) {dir.create(json_path)}

Seriously. All we need to do is to replace readr::write_csv() with jsonlite::write_json() and add pretty=TRUE for a nicely indented JSON output.

gapminder %>%
  group_by(country) %>%
  mutate(country_name = country) %>% # to preserve country name in json object...
  select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
  group_walk(~ write_json(.x, pretty = TRUE, path = fs::path(json_path, str_replace_all(.y[["country"]], "[^a-zA-Z]", ""), ext = "json")))

inspect the folder

list.files(json_path, pattern = "json$") %>% head(20)
##  [1] "Afghanistan.json"          "Albania.json"             
##  [3] "Algeria.json"              "Angola.json"              
##  [5] "Argentina.json"            "Australia.json"           
##  [7] "Austria.json"              "Bahrain.json"             
##  [9] "Bangladesh.json"           "Belgium.json"             
## [11] "Benin.json"                "Bolivia.json"             
## [13] "BosniaandHerzegovina.json" "Botswana.json"            
## [15] "Brazil.json"               "Bulgaria.json"            
## [17] "BurkinaFaso.json"          "Burundi.json"             
## [19] "Cambodia.json"             "Cameroon.json"

inspect the JSON output read in as df

AFG_JSON <- read_json(fs::path(json_path, "Afghanistan.json"), simplifyVector = TRUE)
AFG_JSON %>% head() %>% knitr::kable("html")
country_name continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134

And finally, JSON as pretty JSON

toJSON(AFG_JSON[1,], pretty = TRUE) # first observation
## [
##   {
##     "country_name": "Afghanistan",
##     "continent": "Asia",
##     "year": 1952,
##     "lifeExp": 28.801,
##     "pop": 8425333,
##     "gdpPercap": 779.4453
##   }
## ]

That’s it. Those two methods are going to be very helpful to me, and I hope that they will as useful to others. And be it just for didactic reasons, esp. because as in many other cases I only ended up playing with the new dplyr verbs after seeing an interesting example on Twitter.

4 Last but not least: the data.table way

Short after my initial Tweet, @michael_chirico suggested to me to take a look at his data.table solution. I prefer the tidy grammar over performance (at least for smaller data sets), but this looks very pragmatic: