Category: data cleaning and tidying

  • Tangling with Location Names, part 2

    Once I had my long list of names, joining my no_matches set  was very easy. I used an inner_join() and pulled in any record from tris_data_long_clean that matches my locations, both in name and in country. I was able to snag several more records.

    tris_matches_2_no_matches <- no_matches %>% 
      mutate(ancientcity = str_to_lower(ancientcity),
             country = str_to_lower(country)) %>%
      inner_join(
        tris_data_long_clean,
        by = join_by(ancientcity == name,
                     country == country)) 
    • Apply str_to_lower() on both ancientcity and country. This helps standardize the name in case any random capitalizations exist, and assists in the inner_join().
    • inner_join() to match (exactly) on the name, and also the country, to ensure we have the right place. If there is no 1:1 match on name/country, then the record will be dropped. These locations will be re-joined later.
    41 more records – not bad!

    After this, I had 2 sets I wanted to bring together: locations_5km – the set joined by coordinates with Pleiades IDs; and then the new set generated from no_matches with TM IDs. I wanted to pull all of them together into one, tidy data frame. I remembered that Pleiades uses Trismegistos as one of their standard sources, and to keep things consistent, I decided to pull over the available TM ID’s from the Pleiades data. I also pulled over the Roman provincia, if it was available.

    The following code, as in the post before, is broken up, but is actually one long script, meant to be run all at once.

    # Start Script
    
    # Block 1
    # Bind the 2 data sets together by rows.
    locations_plei_and_tris <- locations_5km %>% 
      rename("plei_id" = id) %>% 
      select(-long,
             -lat) %>% 
      bind_rows(
        tris_matches_2_no_matches %>% 
          rename("orig_lat" = lat,
                 "orig_long" = longi,
                 "tris_id" = id)) %>% 
    • With locations_5km, I rename the ID column to plei_id to specify that it is the Pleiades ID.
    • Remove the long and lat column with select() – I only want to use my original coordinate values. I will keep the distance_from_original value for reference.
    • Join the 2 sets: locations_5km and tris_matches_2_no_matches, by row with bind_rows(). This will “stack” the rows, combining like columns together, and pulling over whatever columns that are not in common between the 2 sets, and filling them with NA values. Within the bind, I clean up some column names within tris_matches_2_no_matches/

    Next was a series of left_joins() to pull in the TM ID’s for the existing Pleiades ID’s I had and the provincia information for those ID’s.

      # Block 2
      left_join(
        pleiades_from_json_cleaned_tidied$pleiades_references %>% 
          filter(grepl("trismegistos", accessURI, ignore.case = T)) %>% 
          mutate(tris_id_extract = str_extract(accessURI, "[0-9]+")) %>% 
          select(id, tris_id_extract),
        by = join_by(plei_id == id)) %>% 
      # Block 3
      left_join(
        tris_data_long_clean %>% 
          select(id, 
                 provincia, 
                 country) %>% 
          mutate(id = as.character(id)) %>% 
          distinct(),
        by = join_by(tris_id_extract == id)
      ) %>% 
      # Block 4
      mutate(trismegistos_id = coalesce(as.character(tris_id), tris_id_extract),
             trismegistos_provincia = coalesce(provincia.x, provincia.y)) %>% 
      rename("country" = country.x) %>% 
      select(-provincia.x,
             -region_ext,
             -tris_id_extract,
             -provincia.y,
             -country.y,
             -tris_id) 
    
    # End script.
    locations_cited <- locations_master %>% 
      rename("orig_region" = ancientregion,
             "orig_country" = country) %>% 
      left_join(
        locations_plei_and_tris %>% 
        select(-ancientcity,
               -ancientregion,
               -country,
               -orig_lat,
               -orig_long),
        by = join_by(locationID == locationID))  
    • In the original locations_master, apply better names to reflect which source the region and country came from – assign with rename().
    • left_join() to the locations_plei_and_tris object that was just created. Inside the join, remove what would be redundant data. Any locationIDs that do not have a match will have NA values in both the plei_id and trismegistos_id fields.
    • Join the 2 sets by locationID. Store in the object locations_cited.

    Quick look at some of the records that didn’t come over. These will require some more research or fuzzy matching to sweep them up. Fortunately, there were only 109 orphaned records out of my original set of ~1,300.

    I also noticed that some of my locations independently pulled in a Pleiades ID and a TM ID, creating a duplicate row in some cases. I will have to figure out how suppress that in future iterations.

    Despite that, I decided that this is really, really good locations data set, and that it would be OK to pause here and move onto something else. I was interested in taking one of my data sets, linking it to my locations data, and plotting it on a world map. Fortunately, I had a lot of other records tied to these locationID’s where I could do that.

  • Tangling with Location Names, part 1

    pleiades_from_json_cleaned_tidied #Cleaned data from Pleiades JSON data
    locations_5km  #Locations within 5km of the original coordinates.
    locations_master #My original locations list.
    no_matches #Locations that did not find a match during the fuzzyjoin.
    # Start script
    
    # Block 1
    # Start big cleanup of tris data
    tris_data_long_clean <- trismegistos_all_export_geo %>% 
      as_tibble() %>% 
      select(id,
             provincia,
             country,
             name_latin,
             name_standard,
             full_name) %>% 
     separate_wider_delim(
        full_name,
        delim = " - ",
        names = c("country_region_ext","city_ext"),
        too_many = "merge",
        too_few = "align_start") %>% 
    • Note: trismegistos_all_export_geo is the object loaded from the TM .csv file referenced above. From this, use select() to pick only the necessary variables/columns.
    • Start the first separation. It looks like there is only one initial break marked with space-underscore-space ( – ), which will be the delimiter in delim = . The next steps are done inside the separation.
      • Specify the full_name column, and the delimiter.
      • Separate column into 2 columns: "country_region_ext" and "city_ext".
      • In the event there is more or less data than can fit in 2 columns, what to do with those values . If there are more than 2 (too_many), the remaining values are merged into the 2nd column with separators in tact. If there is 1 value (too_few), it will be placed in country_ext, the start of the created columns.
    # Block 2
      separate_wider_delim(
        country_region_ext,
        delim = ",",
        names = c("country_ext","region_ext"),
        too_many = "merge",
        too_few = "align_start") %>%
    • Do another separation, just like above, this time breaking country_region_ext at the commas to get the country and regions alone.
    #Block 3
    mutate(city_ext = str_replace_all(city_ext, " \\[", "(")) %>% 
    • Mutate the city_ext column using str_replace_all(). The inconsistent bracket vs. parenthesis make the next separation difficult. So, I want to replace left-brackets with left-parenthesis. The \\ with the left-brackets are escape characters, and need to be used when working with certain characters inside R.*
    # Block 4
      separate_wider_delim(
        city_ext,
        delim = "(",
        names = c("city_ext_1","city_ext_2"),
        too_many = "merge",
        too_few = "align_start") %>%
    • Yet more separations. Just like 2 and 3 above, this time breaking city_ext at the parenthesis. This allows me to get at the appended information I discussed above.
    # Block 5
      separate_wider_delim(
        name_latin,
        delim = " - ",
        names = paste0("tris_latin_", 1:25),
        too_many = "merge",
        too_few = "align_start") %>% 
      select(where(~ !all(is.na(.))),
             -country_ext) %>% 
    • Some of these names have, like, 20+ delimiters (all consistent, thank God!), so I separate them into a large number of columns – more than I probably need. To do this, I use paste0() to affix a “tris_” to the columns, then a sequence of 1-to-whatever I want. Here, 25 – seemed like plenty. This results in 25 columns named tris_1 through tris_25, plenty of room for separation.
    • Because not every place will have 25 different names, those extra columns will be filled entirely with NA values. Here, the select() function looks across all of my columns, and filters out those where all of the values in the column are NA. At the same time, I remove the country_ext, as it is truly redundant.
    # Block 6
      pivot_longer(!c(id, provincia, region_ext, country),
                   names_to = "source_seq",
                   values_to = "name") %>% 
    • Pivot this (now) very wide list into one long list using pivot_longer(). Pivot everything except id, provincia, region_ext, and country. Save the column names into a new column called "source_seq" and the city names into "name".
    # Block 7
      mutate(name = str_trim(str_to_lower(name)),
             region_ext = str_trim(region_ext),
             name = if_else(name == "", NA, name),
             country = str_to_lower(country),
             name = str_replace_all(name, "\\)", "")) %>% 
      filter(!is.na(name)) %>% 
      select(-source_seq) %>% 
      distinct() %>% 
      relocate(name, .after = id)
    
    #End script
    • Apply several mutations to clean up some of the columns:
      • Trim the whitespace with str_trim() in name and region_ext (the separation process can generate a lot of it). I also converted the names to lowercase with str_to_lower() for standardization.
      • Apply an if_else() – if the name is an empty field (represented by “”), then replace it with an NA, so it can easily be filtered out later. Else, leave whatever is in place.
      • Convert country to lower case. This should make the country names more standardized – important for joining later.
      • Replace the random right-parenthesis artifacts from the separation with blanks.
    • Filter out any NA values in the name column.
    • Remove the source_seq column. It is not needed anymore.
    • distinct() to condense all identical records into one, so each record is distinct.
    • Finally, move the name column so it is right after the id – easier for me to read.

    TM data before:

    R screenshot

    TM data after:

    So much better.

    The final step to all of this was using this list to join the no_matches records to see if I could sweep in some of the TM sources where I did not have Pleiades. I also wanted to pull the TM ID’s from the Pleiades data and put them into my set. Since this script was so long, I stored the results in an object (tris_data_long_clean) and saved it as an .rds file for safe keeping.

    # Save
    saveRDS(tris_data_long_clean, paste0(objects_directory,"tris_data_long_clean.rds"))
  • Wrangling JSON Data

    Once again, I return to Pleiades, and my quest for excellent source data. When I first downloaded and explored their JSON data, it was confusing to me. Even with the flattening applied, the result was one, massive list, divided into 2 main groups. Those lists had more lists, several layers deep in some sets. But even then, it was easy to see how the data lined up. 

    I clicked on the list to bring up the R Studios view tab. This allowed me to collapse and expand the lists to get a sense of what was in there. If you want to keep it all in-console, use glimpse(), then use the $ to select a list from the level below. You can use $ for as many levels of data there are in the list.

    R Studio – View(pleiades_full_json)– view of the file with the @graph list expanded.
    glimpse(pleiades_full_json$@graph) – view of the same list. <list> types shown, then individual structures are shown with dimensions.

    What I really wanted was the “id”, or the unique record ID that identifies every place in Pleiades. That should allow me to pair those ID’s up to any of these data sets and maintain the fidelity of the information. I have always wanted those references, so I went for that, first.

    • Pull out both the id list and the references list from the full list, then convert them into their own objects called plei_ids and plei_refs, respectively.
    • Pull the 2 objects together in one tibble at the same time. The id will pair up to the respective records in the references table. Because one id may refer to multiple references, a nested value is created in the new tibble for the references column. unnest(references) will expand these into their own rows.
    • Note: places with many references will therefore generate many new rows. If there are more lists, they will show up as they have previously, nested inside the columns. It looks like that is not the case here, since my URL field has been converted to <chr>:
    The data set with the above steps applied.

    Did a little spot checking.

    • select() only the id column and use distinct() just to make sure I am getting 100% unique id’s. I then sample 3 of them at random using slice_sample(n = 3).
    • filter() to only those id’s chosen and see what URL’s are tied to them.
    • Check each of those id’s on https://pleiades.stoa.org/ and make sure the same references are being identified under the References section of the place’s page. I did this a few times to make sure.

    I thought this looked good! At some point, I want to use these references to do some more exploration. There are a ton of other database systems linked in here (like Trismegistos!) that would be really fun to poke at.

    Next, I wanted to check out the Locations data. At first glance, the data looked bonkers – tons of nested tables. However, it’s really not that bad once you understand what’s in the set. Many of these nested tables were available in the top-level of the list, which would allow you to link the id directly to the value, rather than trying to unnest the columns in the larger set – exactly as I did with the references set above. I was interested in the representative coordinates (latitude, longitude), which are stored in pleiades_full_json$@graph$reprPoint.

    • Store an object called plei_rep_locs with the data from pleiades_full_json$`@graph`$reprPoint.
    • The next steps flow together to make the final location set:
      • Create a tibble called pleiades_locations_id_match that combines plei_ids and location_data (as we did before).
      • The location_data comes over as a double-valued list. If you unnest(), you will create a new record for each coordinate (latitude will be on one line, longitude under it). That is not helpful. Instead, we can use unnest_wider() to coerce those values into 2 columns.
        • names_sep = "_" tells unnest to use the column name as the base name for the new columns, and apply an underscore ( _ ) to separate the name and the column position of the set. Since I only have 2 coordinates, this will create location_data_1 and location_data_2.
      • Finally, rename the columns so you know for sure which coordinate is which.
    • Create an empty list(). Here mine is called pleiades_from_json_cleaned_tidied.
    • Add your sets to the list. The name inside the brackets [[ ... ]] will be the name inside your list. Name it something meaningful.
    • Save as an .rds. Here, I have a predefined value for my objects directory, but this can be any place you want to save the file. Again, name the .rds something meaningful.