Month: February 2025

  • Finding Location Information with fuzzyjoin

    Now that I had some good location data, I wanted to use my coordinates to join the Pleiades data and see what …stuff is in the immediate area of my locations. To do this, I used a package called fuzzyjoin, which has a function called geo_inner_join (and geo_right_join, geo_left_join, etc.)* which allows you to join 2 sets of coordinates based on distance – either kilometers or miles. I decided that 2km (1.24 m) would be a good place to start. The joining took several seconds and produced a long data set with each identified each Pleiades ID on its own line.

    # Read in object.
    locations_master  <- readRDS(paste0(objects_directory,"locations_reversed_joined.rds"))
    
    
    #attestations based on lat/long - this will take several seconds.
    locations_join <- locations_master %>% 
      filter(!is.na(lat)) %>% 
      rename("orig_lat" = lat,
             "orig_long" = longi) %>% 
      geo_inner_join(
        pleiades_from_json$pleiades_locations %>% 
          filter(!is.na(long)),
        by = c("orig_long" = "long",
               "orig_lat" = "lat"),
        unit = c("km"),
        max_dist = 2,
        distance_col = "distance_from_original"
        )
    • Filter out the NA values in lat (which should remove any incomplete coordinates).
    • Rename my existing lat/long columns so that I can easily identify which frame they belong to after I perform my join.
    • Perform the geo_inner_join.
      • This will look at the list pleiades_from_json$pleiades_locations and filter out any incomplete coordinates.
      • Join the 2 sets: my orig_long to Pleiades’ long, and my orig_lat to Pleiades’ lat.
      • unit = c("km") – use kilometers (km) for the units.
      • max_dist = 2 – include all matches within 2km of my coordinates.
      • Put the distance value in a column called “distance_from_original“.
    • Save it all in an object called locations_join.
    • * – the inner_join indicates that the script it will keep only the records that match the requirements of the join.

    I forgot that I didn’t have names paired with those coordinates I pulled from the JSON file. Good thing I have that object at the ready. I decided to utilize the title value to give me the basic name of the record associated with the Pleiades ID’s. The pleiades_full_json$@graph$names list has every single variation on a name in Pleiades, and is a little beyond what I needed (for the moment!) I made a set of ID’s-to-titles and saved it to my pleiades_from_json_cleaned_tidied object to keep it handy. Then I joined it to locations_join.

    # As before:
    plei_ids <- pleiades_full_json$`@graph`$id
    
    # Grab titles.
    plei_titles <- pleiades_full_json$`@graph`$title
    
    # Pull id's and titles together.
    pleiades_titles <- tibble(
      id = plei_ids,
      plei_title = plei_titles
    ) 
    
    # Save these to the big list real quick.
    pleiades_from_json_cleaned_tidied[["pleiades_titles"]] <- pleiades_titles
    
    # Save the list.
    saveRDS(pleiades_from_json_cleaned_tidied, paste0(objects_directory, "pleiades_from_json_cleaned_tidied.rds"))
    
    # Join the titles to the joined set so we can see the names of the matched records.
    locations_full_join_w_titles <- locations_join %>% 
      left_join(
        pleiades_titles,
        by = join_by(id == id))
    • Extract titles list and store in plei_titles.
    • Pull both sets together into one tibble called pleiades_titles.
    • Save this new set to the existing pleiades_from_json_cleaned_tidied list.
    • Join the pleiades_titles to locations_join using the id column in both data sets.

    Cool. Next I wanted to see which locations had the most attestations based on these coordinates.

    locations_full_join_w_titles %>% 
      group_by(
        ancientcity,
        country,
        orig_lat,
        orig_long
      ) %>% 
      summarize(n_plei_records = n_distinct(id)) %>% 
      arrange(desc(n_plei_records))
    • Aggregate the data with summarize() and count each distinct id in each group. Sort it so you can see the highest number of ID’s at the top. with arrange(desc()).

    There were some famous, well attested sites that came up, but where was Rome? Or Constantinople? I decided to widen my geo_inner_join to 5km (3.11 miles), then aggregate.

    locations_5km <- locations_master %>% 
      filter(!is.na(lat)) %>% 
      rename("orig_lat" = lat,
             "orig_long" = longi) %>% 
      geo_inner_join(
        pleiades_from_json$pleiades_locations %>% 
          filter(!is.na(long)),
        by = c("orig_long" = "long",
               "orig_lat" = "lat"),
        unit = c("km"),
        max_dist = 5,
        distance_col = "distance_from_original"
      )
    
    locations_5km %>% 
    group_by(
      ancientcity,
      country,
      orig_lat,
      orig_long
    ) %>% 
      summarize(n_plei_records = n_distinct(id)) %>% 
      arrange(desc(n_plei_records))

    How about 10km? (Note: this code is all run together and goes right to the summary.)

    locations_master %>% 
      filter(!is.na(lat)) %>% 
      rename("orig_lat" = lat,
             "orig_long" = longi) %>% 
      geo_inner_join(
        pleiades_from_json$pleiades_locations %>% 
          filter(!is.na(long)),
        by = c("orig_long" = "long",
               "orig_lat" = "lat"),
        unit = c("km"),
        max_dist = 10,
        distance_col = "distance_from_original"
      ) %>% 
      group_by(
        ancientcity,
        country,
        orig_lat,
        orig_long
      ) %>% 
      summarize(n_plei_records = n_distinct(id)) %>% 
      arrange(desc(n_plei_records))

    More obscure cities with tons of records. It also took a little longer to perform the join. The wider the search, the more location overlap I was likely to have. I decided that 5km might be the highest tolerance to apply in my search. And, at some point, it might be interesting to check out which ID’s my records shared in common.

    Portion of the main Pleiades record for Dura with all associated sites plotted.
    All these references!

    Then, I wanted to check on the locations that did not produce any matches.

    # Who did not have matches?
    no_matches <- locations_master %>% 
      left_join(locations_5km %>% 
        group_by(
          locationID) %>% 
          summarize(n_plei_records = n_distinct(id),
                    .groups = "drop"),
        by = join_by(locationID == locationID)) %>% 
      filter(is.na(n_plei_records))
    • Take my original locations_master and left_join() to the 5km data set. Since we just want a quick check of number of records, inside that left_join(), group the records to locationID and then get the total number of distinct Pleiades ID’s per locationID.
    • Join by the locationID in each data set.
    • filter(is.na(n_plei_records)) to focus on the records that did not have any ID’s found for it’s coordinates.

    Finally, save the locations_5km and the locations_master objects for the next round of matches.

    saveRDS(locations_5km, paste0(objects_directory,"locations_5km.rds"))
    saveRDS(locations_master, paste0(objects_directory,"locations_master.rds"))
    saveRDS(no_matches, paste0(objects_directory,"no_matches.rds"))
    

    Robinson D (2025). fuzzyjoin: Join Tables Together on Inexact Matching. R package version 0.1.6, https://github.com/dgrtwo/fuzzyjoin.

  • Wrangling JSON Data

    Once again, I return to Pleiades, and my quest for excellent source data. When I first downloaded and explored their JSON data, it was confusing to me. Even with the flattening applied, the result was one, massive list, divided into 2 main groups. Those lists had more lists, several layers deep in some sets. But even then, it was easy to see how the data lined up. 

    I clicked on the list to bring up the R Studios view tab. This allowed me to collapse and expand the lists to get a sense of what was in there. If you want to keep it all in-console, use glimpse(), then use the $ to select a list from the level below. You can use $ for as many levels of data there are in the list.

    R Studio – View(pleiades_full_json)– view of the file with the @graph list expanded.
    glimpse(pleiades_full_json$@graph) – view of the same list. <list> types shown, then individual structures are shown with dimensions.

    What I really wanted was the “id”, or the unique record ID that identifies every place in Pleiades. That should allow me to pair those ID’s up to any of these data sets and maintain the fidelity of the information. I have always wanted those references, so I went for that, first.

    • Pull out both the id list and the references list from the full list, then convert them into their own objects called plei_ids and plei_refs, respectively.
    • Pull the 2 objects together in one tibble at the same time. The id will pair up to the respective records in the references table. Because one id may refer to multiple references, a nested value is created in the new tibble for the references column. unnest(references) will expand these into their own rows.
    • Note: places with many references will therefore generate many new rows. If there are more lists, they will show up as they have previously, nested inside the columns. It looks like that is not the case here, since my URL field has been converted to <chr>:
    The data set with the above steps applied.

    Did a little spot checking.

    • select() only the id column and use distinct() just to make sure I am getting 100% unique id’s. I then sample 3 of them at random using slice_sample(n = 3).
    • filter() to only those id’s chosen and see what URL’s are tied to them.
    • Check each of those id’s on https://pleiades.stoa.org/ and make sure the same references are being identified under the References section of the place’s page. I did this a few times to make sure.

    I thought this looked good! At some point, I want to use these references to do some more exploration. There are a ton of other database systems linked in here (like Trismegistos!) that would be really fun to poke at.

    Next, I wanted to check out the Locations data. At first glance, the data looked bonkers – tons of nested tables. However, it’s really not that bad once you understand what’s in the set. Many of these nested tables were available in the top-level of the list, which would allow you to link the id directly to the value, rather than trying to unnest the columns in the larger set – exactly as I did with the references set above. I was interested in the representative coordinates (latitude, longitude), which are stored in pleiades_full_json$@graph$reprPoint.

    • Store an object called plei_rep_locs with the data from pleiades_full_json$`@graph`$reprPoint.
    • The next steps flow together to make the final location set:
      • Create a tibble called pleiades_locations_id_match that combines plei_ids and location_data (as we did before).
      • The location_data comes over as a double-valued list. If you unnest(), you will create a new record for each coordinate (latitude will be on one line, longitude under it). That is not helpful. Instead, we can use unnest_wider() to coerce those values into 2 columns.
        • names_sep = "_" tells unnest to use the column name as the base name for the new columns, and apply an underscore ( _ ) to separate the name and the column position of the set. Since I only have 2 coordinates, this will create location_data_1 and location_data_2.
      • Finally, rename the columns so you know for sure which coordinate is which.
    • Create an empty list(). Here mine is called pleiades_from_json_cleaned_tidied.
    • Add your sets to the list. The name inside the brackets [[ ... ]] will be the name inside your list. Name it something meaningful.
    • Save as an .rds. Here, I have a predefined value for my objects directory, but this can be any place you want to save the file. Again, name the .rds something meaningful.

  • Reading in JSON data with jsonlite

    • Use read.csv() to read in the Trismegistos .csv file and save it as an object called trismegistos_all_export_geo.
    • Use read_json() from the jsonlite package to read in the .json file. This was a huge file and took between 10-15 minutes to finish loading. Save this as an object called pleiades_full_json.
      • simplifyDataFrame = T, puts lists containing only records (JSON objects) into a single data frame;
      • flatten = T, flattens nested data frames into a single data frame if possible;
    • Roger Bagnall, et al. (eds.), Pleiades: A Gazetteer of Past Places, 2016, <http://pleiades.stoa.org/> [Accessed: February 17, 2016].
    • H. Verreth, A survey of toponyms in Egypt in the Graeco-Roman period (Trismegistos Online Publications, 2), Leuven: Trismegistos Online Publications, 1253 pp.
    • Ooms J (2014). “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [stat.CO]https://arxiv.org/abs/1403.2805.

  • Using tidygeocoder’s reverse_geocode()

    Way back when I was actively collecting my data, I was obsessed with making sure that the places associated with my entries both existed, and were locatable. I spent a lot of time tracking down place names, alternative names, modern names, whatever was out there. Eventually, the list became very long, with tons of duplicate entries (like, did you know that a lot of early Christian events took place in Rome? Who knew!) I realized I had to organize this nonsense, and decided to build a mySQL database to store it all. I took all my city/country locations and consolidated them to one-record-per-location, assigned a unique ID for that particular place, then created a table for it in database. This is currently how all of my location data exists and is the most up-to-date.

    I wanted to grab my latitudes (lat) and longitudes (longi) and feed them into the reverse_geocode() function, then apply some other useful arguments. I stored it all in an object called “rev_geo“. The whole thing looked like this:

    When reverse geocode finished, I wanted to check the matches. I created a quick a script to check my work and see which countries matched up.

    All of my entries were instances where the names were different, but correct (i.e: Britain vs. United Kingdom); regions that used to stretch through many countries (i.e.: Roman Mauretania), and some of the cities rest on the borders. I thought it looked good and decided to pause. Since the geocode took awhile, I wanted to save it so that I had it on hand for the next steps. I stashed it in my “objects” directory where I save all my .rds files.

    Cambon J, Hernangómez D, Belanger C, Possenriede D (2021). tidygeocoder: An R package for geocoding. Journal of Open Source Software, 6(65), 3544, https://doi.org/10.21105/joss.03544 (R package version 1.0.5)