Tangling with Location Names, part 2

Once I had my long list of names, joining my no_matches set  was very easy. I used an inner_join() and pulled in any record from tris_data_long_clean that matches my locations, both in name and in country. I was able to snag several more records.

tris_matches_2_no_matches <- no_matches %>% 
  mutate(ancientcity = str_to_lower(ancientcity),
         country = str_to_lower(country)) %>%
  inner_join(
    tris_data_long_clean,
    by = join_by(ancientcity == name,
                 country == country)) 
  • Apply str_to_lower() on both ancientcity and country. This helps standardize the name in case any random capitalizations exist, and assists in the inner_join().
  • inner_join() to match (exactly) on the name, and also the country, to ensure we have the right place. If there is no 1:1 match on name/country, then the record will be dropped. These locations will be re-joined later.
41 more records – not bad!

After this, I had 2 sets I wanted to bring together: locations_5km – the set joined by coordinates with Pleiades IDs; and then the new set generated from no_matches with TM IDs. I wanted to pull all of them together into one, tidy data frame. I remembered that Pleiades uses Trismegistos as one of their standard sources, and to keep things consistent, I decided to pull over the available TM ID’s from the Pleiades data. I also pulled over the Roman provincia, if it was available.

The following code, as in the post before, is broken up, but is actually one long script, meant to be run all at once.

# Start Script

# Block 1
# Bind the 2 data sets together by rows.
locations_plei_and_tris <- locations_5km %>% 
  rename("plei_id" = id) %>% 
  select(-long,
         -lat) %>% 
  bind_rows(
    tris_matches_2_no_matches %>% 
      rename("orig_lat" = lat,
             "orig_long" = longi,
             "tris_id" = id)) %>% 
  • With locations_5km, I rename the ID column to plei_id to specify that it is the Pleiades ID.
  • Remove the long and lat column with select() – I only want to use my original coordinate values. I will keep the distance_from_original value for reference.
  • Join the 2 sets: locations_5km and tris_matches_2_no_matches, by row with bind_rows(). This will “stack” the rows, combining like columns together, and pulling over whatever columns that are not in common between the 2 sets, and filling them with NA values. Within the bind, I clean up some column names within tris_matches_2_no_matches/

Next was a series of left_joins() to pull in the TM ID’s for the existing Pleiades ID’s I had and the provincia information for those ID’s.

  # Block 2
  left_join(
    pleiades_from_json_cleaned_tidied$pleiades_references %>% 
      filter(grepl("trismegistos", accessURI, ignore.case = T)) %>% 
      mutate(tris_id_extract = str_extract(accessURI, "[0-9]+")) %>% 
      select(id, tris_id_extract),
    by = join_by(plei_id == id)) %>% 
  # Block 3
  left_join(
    tris_data_long_clean %>% 
      select(id, 
             provincia, 
             country) %>% 
      mutate(id = as.character(id)) %>% 
      distinct(),
    by = join_by(tris_id_extract == id)
  ) %>% 
  # Block 4
  mutate(trismegistos_id = coalesce(as.character(tris_id), tris_id_extract),
         trismegistos_provincia = coalesce(provincia.x, provincia.y)) %>% 
  rename("country" = country.x) %>% 
  select(-provincia.x,
         -region_ext,
         -tris_id_extract,
         -provincia.y,
         -country.y,
         -tris_id) 

# End script.
locations_cited <- locations_master %>% 
  rename("orig_region" = ancientregion,
         "orig_country" = country) %>% 
  left_join(
    locations_plei_and_tris %>% 
    select(-ancientcity,
           -ancientregion,
           -country,
           -orig_lat,
           -orig_long),
    by = join_by(locationID == locationID))  
  • In the original locations_master, apply better names to reflect which source the region and country came from – assign with rename().
  • left_join() to the locations_plei_and_tris object that was just created. Inside the join, remove what would be redundant data. Any locationIDs that do not have a match will have NA values in both the plei_id and trismegistos_id fields.
  • Join the 2 sets by locationID. Store in the object locations_cited.

Quick look at some of the records that didn’t come over. These will require some more research or fuzzy matching to sweep them up. Fortunately, there were only 109 orphaned records out of my original set of ~1,300.

I also noticed that some of my locations independently pulled in a Pleiades ID and a TM ID, creating a duplicate row in some cases. I will have to figure out how suppress that in future iterations.

Despite that, I decided that this is really, really good locations data set, and that it would be OK to pause here and move onto something else. I was interested in taking one of my data sets, linking it to my locations data, and plotting it on a world map. Fortunately, I had a lot of other records tied to these locationID’s where I could do that.

Leave a Reply

Your email address will not be published. Required fields are marked *