Collin K. Berke, Ph.D.
  • Home
  • About
  • Now
  • Blog
  • Today I Learned

On this page

  • Background
  • separate_longer_delim()
  • separate_wider_position()
  • separate_wider_delim()
  • Too much or Too little data
    • too_many
    • too_few
  • separate_wider_regex()
  • Wrap up

Other Links

  • Chapter 14: Strings R4DS
  • separate_wider_* docs
  • separate_longer_* docs
  • tidyr 1.3.0 release notes

Separate character strings into rows and columns using tidyr functions

til
tidyr
data wrangling
Need to separate strings? Use the separate_* family of functions
Author

Collin K. Berke, Ph.D.

Published

December 27, 2024

Photo by Johnny Briggs

Background

TIL, as of tidyr 1.3.0, there’s a new family of string separation functions:

  • separate_wider_delim()
  • separate_wider_position()
  • separate_wider_regex()
  • separate_longer_delim()
  • separate_longer_position()

These functions generally do two things:

  1. Separate strings into individual rows or columns.
  2. Separate strings by some character delimiter, position, or regular expression.

If you’ve used tidyr in the past, you’re likely familiar with the separate() function. This function was useful in cases where character strings needed to be separated into different columns based on a pattern. While re-reading the 2nd edition of the R for Data Science book for the Online Data Science Learning Community (check us out here), I was reminded separate() was superseded by this family of functions. For myself, I decided writing a post was needed to better understand how to use these functions. In this post I’ll describe each function and build on what’s discussed in the book by sharing some examples I might use these functions.

library(tidyverse)

separate_longer_delim()

Let’s start with separate_longer_delim(). This function takes a character string and turns it into additional rows using a delimiter. We’ll need some example data.

I’m a college sports fan, so let’s create a dataset containing Big Ten and Big 12 conference teams. Here’s the code to create some example data:

big_ten <- str_c(
  "Illinois",
  "Indiana",
  "Iowa",
  "Maryland",
  "Michigan",
  "Michigan State",
  "Minnesota",
  "Nebraska",
  "Northwestern",
  "Ohio State",
  "Oregon",
  "Penn State",
  "Purdue",
  "Rutgers",
  "UCLA",
  "USC",
  "Washington",
  "Wisconsin",
  sep = ", "
)

big_12 <- str_c(
  "Arizona State",
  "Iowa State",
  "BYU",
  "Colorado",
  "Baylor",
  "TCU",
  "Texas Tech",
  "Kansas State",
  "West Virginia",
  "Kansas",
  "Cincinnati",
  "Houston",
  "Utah",
  "Arizona",
  "UCF",
  "Oklahoma State",
  sep = ", "
)

data_teams <- tibble(
  conferene = c("Big Ten", "Big 12"),
  teams = c(big_ten, big_12)
)

data_teams
# A tibble: 2 × 2
  conferene teams                                                                                   
  <chr>     <chr>                                                                                   
1 Big Ten   Illinois, Indiana, Iowa, Maryland, Michigan, Michigan State, Minnesota, Nebraska, North…
2 Big 12    Arizona State, Iowa State, BYU, Colorado, Baylor, TCU, Texas Tech, Kansas State, West V…

In practice, this dataset isn’t really useful: having a list of teams in a character string doesn’t afford us any ability to apply other operations to the data. It would be useful if the data were in long format, where each team is on its own row, and the conference is repeated. It’s also important to recognize each team in the character string are delimited by a comma. Let’s use the separate_longer_delim() function to make it more useful. Here’s what the code looks like:

data_teams |>
  separate_longer_delim(
    cols = teams,
    delim = ", "
  )
# A tibble: 34 × 2
   conferene teams         
   <chr>     <chr>         
 1 Big Ten   Illinois      
 2 Big Ten   Indiana       
 3 Big Ten   Iowa          
 4 Big Ten   Maryland      
 5 Big Ten   Michigan      
 6 Big Ten   Michigan State
 7 Big Ten   Minnesota     
 8 Big Ten   Nebraska      
 9 Big Ten   Northwestern  
10 Big Ten   Ohio State    
# ℹ 24 more rows

separate_wider_position()

separate_longer_position() separates character strings into additional rows using a position index, rather than some type of character delimiter. The book mentions the use of separate_longer_position() is quite rare. In fact, I struggled thinking of a practical example of where this might be applied. As such, this example is a little contrived. How about days it rained in Lincoln, NE for the months of September and October, 2024?

rain_sept <- str_c(
  "N", "N", "N", "N", "N",
  "Y", "N", "N", "N", "Y",
  "N", "N", "N", "N", "N",
  "N", "Y", "Y", "Y", "N",
  "Y", "Y", "Y", "Y", "N",
  "N", "N", "N", "N", "N"
)

rain_oct <- str_c(
  "N", "N", "Y", "N", "N",
  "N", "N", "N", "N", "N",
  "N", "N", "N", "N", "N",
  "N", "N", "N", "N", "N",
  "Y", "N", "Y", "N", "N",
  "N", "N", "N", "N", "Y",
  "N"
)

data_lnk_rain <- tibble(
  year = 2024,
  month = c("September", "October"),
  rain = c(rain_sept, rain_oct)
)

data_lnk_rain
# A tibble: 2 × 3
   year month     rain                           
  <dbl> <chr>     <chr>                          
1  2024 September NNNNNYNNNYNNNNNNYYYNYYYYNNNNNN 
2  2024 October   NNYNNNNNNNNNNNNNNNNNYNYNNNNNNYN

Again, what if we want longer data, where each day’s observation gets its own row? The character strings in the rain variable don’t really have a point of reference where to separate the string. The only point of reference we have is positionally using some type of index. In this case, we have a character string of length 1 representing whether precipitation was observed that day: “Y” = rain; “N” = no rain. While we’re at it, let’s also mutate a day column for each observation. The following code chunk demonstrates how to do this:

data_lnk_rain |>
  separate_longer_position(rain, width = 1) |>
  group_by(year, month) |>
  mutate(day = 1:n(), .before = rain)
# A tibble: 61 × 4
# Groups:   year, month [2]
    year month       day rain 
   <dbl> <chr>     <int> <chr>
 1  2024 September     1 N    
 2  2024 September     2 N    
 3  2024 September     3 N    
 4  2024 September     4 N    
 5  2024 September     5 N    
 6  2024 September     6 Y    
 7  2024 September     7 N    
 8  2024 September     8 N    
 9  2024 September     9 N    
10  2024 September    10 Y    
# ℹ 51 more rows

separate_wider_delim()

Separating strings into columns is a common operation. At times, extra variables are crammed into one variable, where it would be useful if the data was separated into additional variables. Let’s go back to our college teams example data.

This data will combine team names with conferences. However, it’s often useful to have a separate variable only containing the conference (i.e., we may want to filter out specific conferences for an analysis). At times, our data will look like this:

data_team_conference <- tibble(
  team = c(
    "Nebraska (Big Ten)",
    "Texas (Big 12)",
    "Pittsburgh (ACC)",
    "Creighton (Big East)",
    "Texas A&M (SEC)"
  )
)

data_team_conference
# A tibble: 5 × 1
  team                
  <chr>               
1 Nebraska (Big Ten)  
2 Texas (Big 12)      
3 Pittsburgh (ACC)    
4 Creighton (Big East)
5 Texas A&M (SEC)     

We can now separate this column of character strings into multiple columns using separate_wider_delim(). In the end, we want two columns: team and conference. Column names are passed as a character string of names to the names argument. The ( character will be used as the delimiter to separate into the two columns, passed to the delim argument. The last step is to use mutate() and str_remove() to remove the extra ) at the end of the character string. Here is the code to perform all these steps:

data_team_conference |>
  separate_wider_delim(
    team,
    delim = " (",
    names = c("team", "conference")
  ) |>
  mutate(
    conference = str_remove(conference, "\\)")
  )
# A tibble: 5 × 2
  team       conference
  <chr>      <chr>     
1 Nebraska   Big Ten   
2 Texas      Big 12    
3 Pittsburgh ACC       
4 Creighton  Big East  
5 Texas A&M  SEC       

Too much or Too little data

One common problem arises when you separate strings, though: sometimes you end up with too much or too little data. Moreover, at times, you need more information to debug problems with the separation step. separate_wider_delim() provides several arguments when confronted with these issues. These arguments provide additional functionality, including creating additional columns, dropping extra data, merging, or outputting debugging information.

The too_few and too_many arguments are available to help in these situations. I highlight their utility with a couple of examples to follow. However, I suggest reading and reviewing the examples in the docs to gain a complete understanding of the utility of these arguments.

Take for example file path parsing. Let’s start with some example data.

data_file_paths <- tibble(
  path = c(
    "2024_10_01-1925-f-monthly_users",
    "2024_11_01-1925-f-monthly_users",
    "2024_12_01-1925-f-monthly_users",
    "2024_10_01-1925-m-monthly_users",
    "2024_11_01-1925-m-monthly_users",
    "2024_12_01-1925-m-monthly_users",
    "2024_10_01-2635-f-monthly_users",
    "2024_11_01-2635-f-monthly_users",
    "2024_12_01-2635-f-monthly_users",
    "2024_10_01-2635-m-monthly_users",
    "2024_11_01-2635-m-monthly_users",
    "2024_12_01-2635-m-monthly_users"
  )
)

data_file_paths
# A tibble: 12 × 1
   path                           
   <chr>                          
 1 2024_10_01-1925-f-monthly_users
 2 2024_11_01-1925-f-monthly_users
 3 2024_12_01-1925-f-monthly_users
 4 2024_10_01-1925-m-monthly_users
 5 2024_11_01-1925-m-monthly_users
 6 2024_12_01-1925-m-monthly_users
 7 2024_10_01-2635-f-monthly_users
 8 2024_11_01-2635-f-monthly_users
 9 2024_12_01-2635-f-monthly_users
10 2024_10_01-2635-m-monthly_users
11 2024_11_01-2635-m-monthly_users
12 2024_12_01-2635-m-monthly_users

Column separation can be performed either by specifying a delimiter or by position. Let’s start off by separating by delimiter by using the separate_wider_delim() function. We specify the following arguments:

  • data - The data containing the column(s) we want to separate.
  • cols - The column(s) we want to separate.
  • delim - The character delimiter we want to use to separate our information into newly mutated columns.
  • names - The names of the columns the separated column will be separated into.

In this example we’ll separate path into four columns: date, age_grp, gender, and report. Here’s what the code looks like to separate by a character delimiter:

separate_wider_delim(
  data = data_file_paths,
  cols = path,
  delim = "-",
  names = c("date", "age_grp", "gender", "report")
)
# A tibble: 12 × 4
   date       age_grp gender report       
   <chr>      <chr>   <chr>  <chr>        
 1 2024_10_01 1925    f      monthly_users
 2 2024_11_01 1925    f      monthly_users
 3 2024_12_01 1925    f      monthly_users
 4 2024_10_01 1925    m      monthly_users
 5 2024_11_01 1925    m      monthly_users
 6 2024_12_01 1925    m      monthly_users
 7 2024_10_01 2635    f      monthly_users
 8 2024_11_01 2635    f      monthly_users
 9 2024_12_01 2635    f      monthly_users
10 2024_10_01 2635    m      monthly_users
11 2024_11_01 2635    m      monthly_users
12 2024_12_01 2635    m      monthly_users

The same outcome can be achieved by separating positionally as well. Here we just use some index values passed via a semi-named vector to the widths argument. The unnamed portions are the empty, unneeded portions of the string we want to separate by. The named numeric values are the character lengths of each piece of information we want to capture into our newly created columns.

separate_wider_position(
  data = data_file_paths,
  cols = path,
  widths = c(date = 10, 1, age_grp = 4, 1, gender = 1, 1, report = 13)
)
# A tibble: 12 × 4
   date       age_grp gender report       
   <chr>      <chr>   <chr>  <chr>        
 1 2024_10_01 1925    f      monthly_users
 2 2024_11_01 1925    f      monthly_users
 3 2024_12_01 1925    f      monthly_users
 4 2024_10_01 1925    m      monthly_users
 5 2024_11_01 1925    m      monthly_users
 6 2024_12_01 1925    m      monthly_users
 7 2024_10_01 2635    f      monthly_users
 8 2024_11_01 2635    f      monthly_users
 9 2024_12_01 2635    f      monthly_users
10 2024_10_01 2635    m      monthly_users
11 2024_11_01 2635    m      monthly_users
12 2024_12_01 2635    m      monthly_users

too_many

Say we have a case where we don’t need the end of the path, specifically the -monthly_users portion. For example:

separate_wider_delim(
  data = data_file_paths,
  cols = path,
  delim = "-",
  names = c("date", "age_grp", "gender")
)
Error in `separate_wider_delim()`:
! Expected 3 pieces in each element of `path`.
! 12 values were too long.
ℹ Use `too_many = "debug"` to diagnose the problem.
ℹ Use `too_many = "drop"/"merge"` to silence this message.

We get an error. The error is the result of having more data then there are columns to separate into. To address this, we need to pass different options to the too_many argument of the function. Let’s use too_many = "debug" to receive additional information on what needs to be fixed. Although the problem is pretty straightforward here, I wanted to show this option in case you’re confronted with a situation with a more complex separation.

separate_wider_delim(
  data = data_file_paths,
  cols = path,
  delim = "-",
  names = c("date", "age_grp", "gender"),
  too_many = "debug"
)
Warning: Debug mode activated: adding variables `path_ok`, `path_pieces`, and `path_remainder`.
# A tibble: 12 × 7
   date       age_grp gender path                            path_ok path_pieces path_remainder
   <chr>      <chr>   <chr>  <chr>                           <lgl>         <int> <chr>         
 1 2024_10_01 1925    f      2024_10_01-1925-f-monthly_users FALSE             4 -monthly_users
 2 2024_11_01 1925    f      2024_11_01-1925-f-monthly_users FALSE             4 -monthly_users
 3 2024_12_01 1925    f      2024_12_01-1925-f-monthly_users FALSE             4 -monthly_users
 4 2024_10_01 1925    m      2024_10_01-1925-m-monthly_users FALSE             4 -monthly_users
 5 2024_11_01 1925    m      2024_11_01-1925-m-monthly_users FALSE             4 -monthly_users
 6 2024_12_01 1925    m      2024_12_01-1925-m-monthly_users FALSE             4 -monthly_users
 7 2024_10_01 2635    f      2024_10_01-2635-f-monthly_users FALSE             4 -monthly_users
 8 2024_11_01 2635    f      2024_11_01-2635-f-monthly_users FALSE             4 -monthly_users
 9 2024_12_01 2635    f      2024_12_01-2635-f-monthly_users FALSE             4 -monthly_users
10 2024_10_01 2635    m      2024_10_01-2635-m-monthly_users FALSE             4 -monthly_users
11 2024_11_01 2635    m      2024_11_01-2635-m-monthly_users FALSE             4 -monthly_users
12 2024_12_01 2635    m      2024_12_01-2635-m-monthly_users FALSE             4 -monthly_users

The too_many = "debug" outputs a tibble with some additional columns ( *_ok, *_pieces, *_remainder) of information. The * being the name of the column to be separated. These columns contain information to help us quickly diagnose the problem. Using our example data, we get the following:

This column is useful for identifying the presence of any variable length strings.

  • path_ok provides a boolean to quickly identify cases where the separation failed.

  • path_pieces represents the number of pieces resulting from separating the string.

  • path_remainder shows what’s left after the separation is performed. This is useful for identifying if there’s any additional information you want to retain in additional columns.

Note

Although not applicable here, a neat trick to quickly identify the columns that didn’t separate as we expected is to use the following:

debug_path |> filter(!x_ok)

Now that we have additional information to help us figure out what’s going on, we can choose another option for the too_many argument to handle our specific case. We have two additional options beyond error and debug:

  • drop will drop the additional information that doesn’t fit into our newly specified columns.
  • merge will keep the additional information, but it will merge it with the data in the final column.

Let’s observe both options:

separate_wider_delim(
  data = data_file_paths,
  cols = path,
  delim = "-",
  names = c("date", "age_grp", "gender"),
  too_many = "drop"
)
# A tibble: 12 × 3
   date       age_grp gender
   <chr>      <chr>   <chr> 
 1 2024_10_01 1925    f     
 2 2024_11_01 1925    f     
 3 2024_12_01 1925    f     
 4 2024_10_01 1925    m     
 5 2024_11_01 1925    m     
 6 2024_12_01 1925    m     
 7 2024_10_01 2635    f     
 8 2024_11_01 2635    f     
 9 2024_12_01 2635    f     
10 2024_10_01 2635    m     
11 2024_11_01 2635    m     
12 2024_12_01 2635    m     
separate_wider_delim(
  data = data_file_paths,
  cols = path,
  delim = "-",
  names = c("date", "age_grp", "gender"),
  too_many = "merge"
)
# A tibble: 12 × 3
   date       age_grp gender         
   <chr>      <chr>   <chr>          
 1 2024_10_01 1925    f-monthly_users
 2 2024_11_01 1925    f-monthly_users
 3 2024_12_01 1925    f-monthly_users
 4 2024_10_01 1925    m-monthly_users
 5 2024_11_01 1925    m-monthly_users
 6 2024_12_01 1925    m-monthly_users
 7 2024_10_01 2635    f-monthly_users
 8 2024_11_01 2635    f-monthly_users
 9 2024_12_01 2635    f-monthly_users
10 2024_10_01 2635    m-monthly_users
11 2024_11_01 2635    m-monthly_users
12 2024_12_01 2635    m-monthly_users

Either operation is pretty straightforward: drop the additional information or merge what’s left in the newly created column. Nonetheless, it’s likely best to debug first. Knowing what’s going on with your separation before applying a fix can be useful, and it will help you avoid parsing mistakes.

too_few

Next, I want to highlight options for when you have too few data. Let’s go back to some college sports examples, specifically college basketball game log data. Such data might look like this. Take note of the W/L column. Not only are wins and losses denoted, but the variable may also contain info if the win occurred during an overtime period. Let’s mimic this structure in some example data.

data_bball_wl <- tibble(
  game = c(1:6),
  team = c(
    "Nebraska",
    "Nebraska",
    "Nebraska",
    "Nebraska",
    "Nebraska",
    "Nebraska"
  ),
  w_l = c(
    "W",
    "W (1 OT)",
    "L",
    "W (3 OT)",
    "L (2 OT)",
    "W"
  )
)

data_bball_wl
# A tibble: 6 × 3
   game team     w_l     
  <int> <chr>    <chr>   
1     1 Nebraska W       
2     2 Nebraska W (1 OT)
3     3 Nebraska L       
4     4 Nebraska W (3 OT)
5     5 Nebraska L (2 OT)
6     6 Nebraska W       
separate_wider_delim(
  data = data_bball_wl,
  cols = "w_l",
  delim = " (",
  names = c("w_l", "ots")
)
Error in `separate_wider_delim()`:
! Expected 2 pieces in each element of `w_l`.
! 3 values were too short.
ℹ Use `too_few = "debug"` to diagnose the problem.
ℹ Use `too_few = "align_start"/"align_end"` to silence this message.

Error, so let’s debug what’s happening with our separation.

separate_wider_delim(
  data = data_bball_wl,
  cols = "w_l",
  delim = " (",
  names = c("w_l", "ots"),
  too_few = "debug"
)
Warning: Debug mode activated: adding variables `w_l_ok`, `w_l_pieces`, and `w_l_remainder`.
# A tibble: 6 × 7
   game team     w_l      ots   w_l_ok w_l_pieces w_l_remainder
  <int> <chr>    <chr>    <chr> <lgl>       <int> <chr>        
1     1 Nebraska W        <NA>  FALSE           1 ""           
2     2 Nebraska W (1 OT) 1 OT) TRUE            2 ""           
3     3 Nebraska L        <NA>  FALSE           1 ""           
4     4 Nebraska W (3 OT) 3 OT) TRUE            2 ""           
5     5 Nebraska L (2 OT) 2 OT) TRUE            2 ""           
6     6 Nebraska W        <NA>  FALSE           1 ""           

Just what we thought, rows 1, 3, 5, and 6 don’t contain enough information to complete our operation of filling the ots variable. The separate_wider_delim() function has two options for the too_many argument to address this issue:

  • align_end adds NA at the start of short matches to pad to the correct length.
  • align_start adds NA at the end of the short matches to pad to the correct length.

I’ll start with align_end first, just to demonstrate what it does, though this operation isn’t what we’re looking to do here. Then, I’ll show you align_start, the operation needed to complete our separation successfully.

separate_wider_delim(
  data = data_bball_wl,
  cols = "w_l",
  delim = " (",
  names = c("w_l", "ot"),
  too_few = "align_end"
)
# A tibble: 6 × 4
   game team     w_l   ot   
  <int> <chr>    <chr> <chr>
1     1 Nebraska <NA>  W    
2     2 Nebraska W     1 OT)
3     3 Nebraska <NA>  L    
4     4 Nebraska W     3 OT)
5     5 Nebraska L     2 OT)
6     6 Nebraska <NA>  W    
separate_wider_delim(
  data = data_bball_wl,
  cols = "w_l",
  delim = " (",
  names = c("w_l", "ot"),
  too_few = "align_start"
)
# A tibble: 6 × 4
   game team     w_l   ot   
  <int> <chr>    <chr> <chr>
1     1 Nebraska W     <NA> 
2     2 Nebraska W     1 OT)
3     3 Nebraska L     <NA> 
4     4 Nebraska W     3 OT)
5     5 Nebraska L     2 OT)
6     6 Nebraska W     <NA> 

In short, all the align selection does is modify where the NA will be placed, essentially modifying the padding to create a correct length for the separation to be valid.

Great, now that we’ve identified where to separate the columns, we just need to do some additional string manipulation to finish the wrangling of this data. Below are the wrangling steps I applied:

data_bball_wl |>
  separate_wider_delim(
    cols = "w_l",
    delim = " (",
    names = c("w_l", "ot"),
    too_few = "align_start"
  ) |>
  mutate(
    n_ot = str_remove(ot, " OT\\)"),
    ot = ifelse(is.na(ot), FALSE, TRUE),
  )
# A tibble: 6 × 5
   game team     w_l   ot    n_ot 
  <int> <chr>    <chr> <lgl> <chr>
1     1 Nebraska W     FALSE <NA> 
2     2 Nebraska W     TRUE  1    
3     3 Nebraska L     FALSE <NA> 
4     4 Nebraska W     TRUE  3    
5     5 Nebraska L     TRUE  2    
6     6 Nebraska W     FALSE <NA> 

separate_wider_regex()

In this last section, we’ll cover a slightly more advanced topic: using regular expressions to separate into different columns. The separate_wider_regex() performs this operation. To highlight how to do this, let’s go back to our file path example data, data_file_paths.

Say, for some reason, after we separated the different fields embedded in the file path, we want to separate the data further into year and month columns, dropping the day information, since it’s redundant in this case (i.e., these files are always the first of the month). Here’s the additional code we would need:

data_file_paths |>
  separate_wider_delim(
    cols = path,
    delim = "-",
    names = c(
      "date",
      "age_grp",
      "gender",
      "report"
    )
  ) |>
  separate_wider_regex(
    cols = date,
    patterns = c(
      year = "\\d{4}",
      "_",
      month = "\\d{2}"
    ),
    too_few = "align_start"
  )
# A tibble: 12 × 5
   year  month age_grp gender report       
   <chr> <chr> <chr>   <chr>  <chr>        
 1 2024  10    1925    f      monthly_users
 2 2024  11    1925    f      monthly_users
 3 2024  12    1925    f      monthly_users
 4 2024  10    1925    m      monthly_users
 5 2024  11    1925    m      monthly_users
 6 2024  12    1925    m      monthly_users
 7 2024  10    2635    f      monthly_users
 8 2024  11    2635    f      monthly_users
 9 2024  12    2635    f      monthly_users
10 2024  10    2635    m      monthly_users
11 2024  11    2635    m      monthly_users
12 2024  12    2635    m      monthly_users

Indeed, it’s not lost on me that this operation could be successfully completed using the following approach as well:

data_file_paths |>
  separate_wider_delim(
    cols = path,
    delim = "-",
    names = c(
      "date",
      "age_grp",
      "gender",
      "report"
    )
  ) |>
  separate_wider_delim(
    cols = date,
    delim = "_",
    names = c("year", "month"),
    too_many = "drop"
  )
# A tibble: 12 × 5
   year  month age_grp gender report       
   <chr> <chr> <chr>   <chr>  <chr>        
 1 2024  10    1925    f      monthly_users
 2 2024  11    1925    f      monthly_users
 3 2024  12    1925    f      monthly_users
 4 2024  10    1925    m      monthly_users
 5 2024  11    1925    m      monthly_users
 6 2024  12    1925    m      monthly_users
 7 2024  10    2635    f      monthly_users
 8 2024  11    2635    f      monthly_users
 9 2024  12    2635    f      monthly_users
10 2024  10    2635    m      monthly_users
11 2024  11    2635    m      monthly_users
12 2024  12    2635    m      monthly_users

However, using a regular expression affords a little more accuracy. We’re able to specifically specify the pattern we’re looking to capture for our column.

Wrap up

The takeaway from this TIL post is simple: use tidyr’s family of separate_wider_* and separate_longer_* functions when you need to separate a string into additional columns or rows.

This post wasn’t aiming to be a fully comprehensive guide on how to use this family of functions. As such, I highly suggest reviewing the following to learn more and view additional example use cases:

  • Chapter 14: Strings from the R for Data Science (2e) book.

  • tidyr’s separate_wider_* and separate_longer_* function documentation (or run ?separate_wider_delim / ?separate_longer_delim in your console).

  • tidyr’s 1.3.0 release notes.

Until next time, keep having fun with your data wrangling 📈.

Reuse

CC BY 4.0

Citation

BibTeX citation:
@misc{berke2024,
  author = {Berke, Collin K},
  title = {Separate Character Strings into Rows and Columns Using
    `Tidyr` Functions},
  date = {2024-12-27},
  langid = {en}
}
For attribution, please cite this work as:
Berke, Collin K. 2024. “Separate Character Strings into Rows and Columns Using `Tidyr` Functions.” December 27, 2024.