Tangling with Location Names, part 1

pleiades_from_json_cleaned_tidied #Cleaned data from Pleiades JSON data
locations_5km  #Locations within 5km of the original coordinates.
locations_master #My original locations list.
no_matches #Locations that did not find a match during the fuzzyjoin.
# Start script

# Block 1
# Start big cleanup of tris data
tris_data_long_clean <- trismegistos_all_export_geo %>% 
  as_tibble() %>% 
  select(id,
         provincia,
         country,
         name_latin,
         name_standard,
         full_name) %>% 
 separate_wider_delim(
    full_name,
    delim = " - ",
    names = c("country_region_ext","city_ext"),
    too_many = "merge",
    too_few = "align_start") %>% 
  • Note: trismegistos_all_export_geo is the object loaded from the TM .csv file referenced above. From this, use select() to pick only the necessary variables/columns.
  • Start the first separation. It looks like there is only one initial break marked with space-underscore-space ( – ), which will be the delimiter in delim = . The next steps are done inside the separation.
    • Specify the full_name column, and the delimiter.
    • Separate column into 2 columns: "country_region_ext" and "city_ext".
    • In the event there is more or less data than can fit in 2 columns, what to do with those values . If there are more than 2 (too_many), the remaining values are merged into the 2nd column with separators in tact. If there is 1 value (too_few), it will be placed in country_ext, the start of the created columns.
# Block 2
  separate_wider_delim(
    country_region_ext,
    delim = ",",
    names = c("country_ext","region_ext"),
    too_many = "merge",
    too_few = "align_start") %>%
  • Do another separation, just like above, this time breaking country_region_ext at the commas to get the country and regions alone.
#Block 3
mutate(city_ext = str_replace_all(city_ext, " \\[", "(")) %>% 
  • Mutate the city_ext column using str_replace_all(). The inconsistent bracket vs. parenthesis make the next separation difficult. So, I want to replace left-brackets with left-parenthesis. The \\ with the left-brackets are escape characters, and need to be used when working with certain characters inside R.*
# Block 4
  separate_wider_delim(
    city_ext,
    delim = "(",
    names = c("city_ext_1","city_ext_2"),
    too_many = "merge",
    too_few = "align_start") %>%
  • Yet more separations. Just like 2 and 3 above, this time breaking city_ext at the parenthesis. This allows me to get at the appended information I discussed above.
# Block 5
  separate_wider_delim(
    name_latin,
    delim = " - ",
    names = paste0("tris_latin_", 1:25),
    too_many = "merge",
    too_few = "align_start") %>% 
  select(where(~ !all(is.na(.))),
         -country_ext) %>% 
  • Some of these names have, like, 20+ delimiters (all consistent, thank God!), so I separate them into a large number of columns – more than I probably need. To do this, I use paste0() to affix a “tris_” to the columns, then a sequence of 1-to-whatever I want. Here, 25 – seemed like plenty. This results in 25 columns named tris_1 through tris_25, plenty of room for separation.
  • Because not every place will have 25 different names, those extra columns will be filled entirely with NA values. Here, the select() function looks across all of my columns, and filters out those where all of the values in the column are NA. At the same time, I remove the country_ext, as it is truly redundant.
# Block 6
  pivot_longer(!c(id, provincia, region_ext, country),
               names_to = "source_seq",
               values_to = "name") %>% 
  • Pivot this (now) very wide list into one long list using pivot_longer(). Pivot everything except id, provincia, region_ext, and country. Save the column names into a new column called "source_seq" and the city names into "name".
# Block 7
  mutate(name = str_trim(str_to_lower(name)),
         region_ext = str_trim(region_ext),
         name = if_else(name == "", NA, name),
         country = str_to_lower(country),
         name = str_replace_all(name, "\\)", "")) %>% 
  filter(!is.na(name)) %>% 
  select(-source_seq) %>% 
  distinct() %>% 
  relocate(name, .after = id)

#End script
  • Apply several mutations to clean up some of the columns:
    • Trim the whitespace with str_trim() in name and region_ext (the separation process can generate a lot of it). I also converted the names to lowercase with str_to_lower() for standardization.
    • Apply an if_else() – if the name is an empty field (represented by “”), then replace it with an NA, so it can easily be filtered out later. Else, leave whatever is in place.
    • Convert country to lower case. This should make the country names more standardized – important for joining later.
    • Replace the random right-parenthesis artifacts from the separation with blanks.
  • Filter out any NA values in the name column.
  • Remove the source_seq column. It is not needed anymore.
  • distinct() to condense all identical records into one, so each record is distinct.
  • Finally, move the name column so it is right after the id – easier for me to read.

TM data before:

R screenshot

TM data after:

So much better.

The final step to all of this was using this list to join the no_matches records to see if I could sweep in some of the TM sources where I did not have Pleiades. I also wanted to pull the TM ID’s from the Pleiades data and put them into my set. Since this script was so long, I stored the results in an object (tris_data_long_clean) and saved it as an .rds file for safe keeping.

# Save
saveRDS(tris_data_long_clean, paste0(objects_directory,"tris_data_long_clean.rds"))

Leave a Reply

Your email address will not be published. Required fields are marked *