Month: March 2025

  • Map Plotting Basics

    Finally, I had a set of neat, tidy locations. It was time to do something with them. Mapping seemed like a logical next step – but where to start? The foundation of churches and dioceses was a data point that popped up often in my research, so I thought that would be a good place to jump in.

    In simple terms, a diocese or see, is the head of Catholic churches in a region, and is overseen by a Bishop. As Christianity grew, so did the establishment of churches, then dioceses to oversee them, and finally archdioceses, and patriarchal sees to oversee everything under them. There are many dioceses that were erected very early on, and indeed count apostles of Christ as their founders. An obvious example is Rome. The diocese is traditionally said to be founded and led by the first Pope, the apostle St. Peter, before his death in ~64.

    In this post, I am not going to share the code or the data that I used to produce my locations list – mainly because it involves operations I have covered extensively (join and bind functions). As I said at the start, If you want to follow along, you just need a file or object that has individually identified longitude and latitude columns, or use one of the sets I link above.

    # This needs to be ran to prevent errors with rendering the world map below. This turns off 
    # spherical geometry package.
    sf_use_s2(FALSE)
    
    # Load packages.
    library(tidyverse)
    library(sf)
    
    # Read in your world shapefile. 
    # These were downloaded from https://www.naturalearthdata.com/
    world <- st_read("... directory where code resides... /ne_10m_admin_0_countries/ne_10m_admin_0_countries.shp")  
    • Load packages tidyverse and sf.
    • Use st_read() from the sf package to read in the shapefile (.shp).
    • Important: Keep the file structure from the .zip file in tact. The file folders need to reside in the same directory as your code for it to function correctly.

    Next, I had to do a little prepping of my data so it could be read by the plotting function I was going to apply. Again, the sf package provided the functions to get things in working order.

    # Check coordinates that the world map uses.
    st_crs(world)
    
    # Make sf object for plotting.
    diocese_sf <- diocese_1 %>% 
      dplyr::filter(status %in% c("Established","in_description","Erected")) %>% 
      rename("lon" = longi) %>% 
      dplyr::filter(!is.na(lat)) %>% 
      dplyr::select(locationID, lon, lat) %>% 
      st_as_sf(coords = c("lon", "lat")) %>% 
      st_set_crs(st_crs(world))
    • Filter my records only to the diocese dates that reflect the start or establishment of a diocese ("Established", "in_description", "Erected").
    • Other quick cleaning: fix the longitude column name so it can be read by the sf functions, filter out the NA values, and select only the necessary columns: locationID, lon, lat.
    • With st_as_sf(), I specify that I am using longitude and latitude coordinates.
    • Use st_set_crs() to match the coordinate system of world.

    I passed the new object into ggplot and used the geom_sf() function to render both the shapefile and my location data. The resulting plot was not helpful.

    Not helpful!

    I needed to crop the map. Finding my boundaries was easy, and the sf package had a cropping function that would constrain the map to a certain area. I just needed to find the min/max values of my coordinates, then use st_crop() to apply the boundaries.

    # Find min/max of my location data for cropping to the area that holds all of my points.
    diocese_1 %>% 
      filter(!is.na(lat)) %>% 
      summarize(min_lon = min(longi),
                max_lon = max(longi),
                min_lat = min(lat),
                max_lat = max(lat))
    
    # Create the cropped polygon object
    world_cropped <- st_crop(world,
            xmin = -10, xmax = 48, 
            ymin = 23, ymax = 57)
    
    • Filter the incomplete coordinates out of diocese_1 by removing NA values with !is.na(). Create a basic summary with the min and max of the lat and longi values.
    • Using st_crop(), create a cropped polygon object using the coordinates found in the previous step. I slightly decreased/increased the min/max values to give some breathing room to the points on the edges of the map

    Now plot!

    ggplot() +
      geom_sf(data = world_cropped,
              fill = "navajowhite") +
      geom_sf(data = diocese_sf,
              alpha = 0.5)
    • Call ggplot(). Here, I do not need to define aesthetics. X, Y values will come from diocese_sf.
    • With geom_sf(), add the world_cropped object as the first layer. I used navajowhite to color the land mass.
    • In the second geom_sf(), add the diocese_sf object. I adjusted the opacity to 50% with alpha = 0.5 to make the clustered points more visible.

    Much better! I immediately saw some interesting patterns in my points – something to explore, next.

    Follow along script is on Github.

    • Cheney, David M. (2025) “Catholic-Hierarchy: Its Bishops and Dioceses, Current and Past.” Copyright David M. Cheney, 1996-2025. https://www.catholic-hierarchy.org/.
    • “CATHOLIC ENCYCLOPEDIA: Home.” (2025). https://www.newadvent.org/cathen/.
    • Pebesma, E., & Bivand, R. (2023). Spatial Data Science: With Applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016
    • “Natural Earth  – Free Vector and Raster Map Data at 1:10m, 1:50m, and 1:110m Scales.” (2025). https://www.naturalearthdata.com/.

  • Tangling with Location Names, part 2

    Once I had my long list of names, joining my no_matches set  was very easy. I used an inner_join() and pulled in any record from tris_data_long_clean that matches my locations, both in name and in country. I was able to snag several more records.

    tris_matches_2_no_matches <- no_matches %>% 
      mutate(ancientcity = str_to_lower(ancientcity),
             country = str_to_lower(country)) %>%
      inner_join(
        tris_data_long_clean,
        by = join_by(ancientcity == name,
                     country == country)) 
    • Apply str_to_lower() on both ancientcity and country. This helps standardize the name in case any random capitalizations exist, and assists in the inner_join().
    • inner_join() to match (exactly) on the name, and also the country, to ensure we have the right place. If there is no 1:1 match on name/country, then the record will be dropped. These locations will be re-joined later.
    41 more records – not bad!

    After this, I had 2 sets I wanted to bring together: locations_5km – the set joined by coordinates with Pleiades IDs; and then the new set generated from no_matches with TM IDs. I wanted to pull all of them together into one, tidy data frame. I remembered that Pleiades uses Trismegistos as one of their standard sources, and to keep things consistent, I decided to pull over the available TM ID’s from the Pleiades data. I also pulled over the Roman provincia, if it was available.

    The following code, as in the post before, is broken up, but is actually one long script, meant to be run all at once.

    # Start Script
    
    # Block 1
    # Bind the 2 data sets together by rows.
    locations_plei_and_tris <- locations_5km %>% 
      rename("plei_id" = id) %>% 
      select(-long,
             -lat) %>% 
      bind_rows(
        tris_matches_2_no_matches %>% 
          rename("orig_lat" = lat,
                 "orig_long" = longi,
                 "tris_id" = id)) %>% 
    • With locations_5km, I rename the ID column to plei_id to specify that it is the Pleiades ID.
    • Remove the long and lat column with select() – I only want to use my original coordinate values. I will keep the distance_from_original value for reference.
    • Join the 2 sets: locations_5km and tris_matches_2_no_matches, by row with bind_rows(). This will “stack” the rows, combining like columns together, and pulling over whatever columns that are not in common between the 2 sets, and filling them with NA values. Within the bind, I clean up some column names within tris_matches_2_no_matches/

    Next was a series of left_joins() to pull in the TM ID’s for the existing Pleiades ID’s I had and the provincia information for those ID’s.

      # Block 2
      left_join(
        pleiades_from_json_cleaned_tidied$pleiades_references %>% 
          filter(grepl("trismegistos", accessURI, ignore.case = T)) %>% 
          mutate(tris_id_extract = str_extract(accessURI, "[0-9]+")) %>% 
          select(id, tris_id_extract),
        by = join_by(plei_id == id)) %>% 
      # Block 3
      left_join(
        tris_data_long_clean %>% 
          select(id, 
                 provincia, 
                 country) %>% 
          mutate(id = as.character(id)) %>% 
          distinct(),
        by = join_by(tris_id_extract == id)
      ) %>% 
      # Block 4
      mutate(trismegistos_id = coalesce(as.character(tris_id), tris_id_extract),
             trismegistos_provincia = coalesce(provincia.x, provincia.y)) %>% 
      rename("country" = country.x) %>% 
      select(-provincia.x,
             -region_ext,
             -tris_id_extract,
             -provincia.y,
             -country.y,
             -tris_id) 
    
    # End script.
    locations_cited <- locations_master %>% 
      rename("orig_region" = ancientregion,
             "orig_country" = country) %>% 
      left_join(
        locations_plei_and_tris %>% 
        select(-ancientcity,
               -ancientregion,
               -country,
               -orig_lat,
               -orig_long),
        by = join_by(locationID == locationID))  
    • In the original locations_master, apply better names to reflect which source the region and country came from – assign with rename().
    • left_join() to the locations_plei_and_tris object that was just created. Inside the join, remove what would be redundant data. Any locationIDs that do not have a match will have NA values in both the plei_id and trismegistos_id fields.
    • Join the 2 sets by locationID. Store in the object locations_cited.

    Quick look at some of the records that didn’t come over. These will require some more research or fuzzy matching to sweep them up. Fortunately, there were only 109 orphaned records out of my original set of ~1,300.

    I also noticed that some of my locations independently pulled in a Pleiades ID and a TM ID, creating a duplicate row in some cases. I will have to figure out how suppress that in future iterations.

    Despite that, I decided that this is really, really good locations data set, and that it would be OK to pause here and move onto something else. I was interested in taking one of my data sets, linking it to my locations data, and plotting it on a world map. Fortunately, I had a lot of other records tied to these locationID’s where I could do that.

  • Tangling with Location Names, part 1

    pleiades_from_json_cleaned_tidied #Cleaned data from Pleiades JSON data
    locations_5km  #Locations within 5km of the original coordinates.
    locations_master #My original locations list.
    no_matches #Locations that did not find a match during the fuzzyjoin.
    # Start script
    
    # Block 1
    # Start big cleanup of tris data
    tris_data_long_clean <- trismegistos_all_export_geo %>% 
      as_tibble() %>% 
      select(id,
             provincia,
             country,
             name_latin,
             name_standard,
             full_name) %>% 
     separate_wider_delim(
        full_name,
        delim = " - ",
        names = c("country_region_ext","city_ext"),
        too_many = "merge",
        too_few = "align_start") %>% 
    • Note: trismegistos_all_export_geo is the object loaded from the TM .csv file referenced above. From this, use select() to pick only the necessary variables/columns.
    • Start the first separation. It looks like there is only one initial break marked with space-underscore-space ( – ), which will be the delimiter in delim = . The next steps are done inside the separation.
      • Specify the full_name column, and the delimiter.
      • Separate column into 2 columns: "country_region_ext" and "city_ext".
      • In the event there is more or less data than can fit in 2 columns, what to do with those values . If there are more than 2 (too_many), the remaining values are merged into the 2nd column with separators in tact. If there is 1 value (too_few), it will be placed in country_ext, the start of the created columns.
    # Block 2
      separate_wider_delim(
        country_region_ext,
        delim = ",",
        names = c("country_ext","region_ext"),
        too_many = "merge",
        too_few = "align_start") %>%
    • Do another separation, just like above, this time breaking country_region_ext at the commas to get the country and regions alone.
    #Block 3
    mutate(city_ext = str_replace_all(city_ext, " \\[", "(")) %>% 
    • Mutate the city_ext column using str_replace_all(). The inconsistent bracket vs. parenthesis make the next separation difficult. So, I want to replace left-brackets with left-parenthesis. The \\ with the left-brackets are escape characters, and need to be used when working with certain characters inside R.*
    # Block 4
      separate_wider_delim(
        city_ext,
        delim = "(",
        names = c("city_ext_1","city_ext_2"),
        too_many = "merge",
        too_few = "align_start") %>%
    • Yet more separations. Just like 2 and 3 above, this time breaking city_ext at the parenthesis. This allows me to get at the appended information I discussed above.
    # Block 5
      separate_wider_delim(
        name_latin,
        delim = " - ",
        names = paste0("tris_latin_", 1:25),
        too_many = "merge",
        too_few = "align_start") %>% 
      select(where(~ !all(is.na(.))),
             -country_ext) %>% 
    • Some of these names have, like, 20+ delimiters (all consistent, thank God!), so I separate them into a large number of columns – more than I probably need. To do this, I use paste0() to affix a “tris_” to the columns, then a sequence of 1-to-whatever I want. Here, 25 – seemed like plenty. This results in 25 columns named tris_1 through tris_25, plenty of room for separation.
    • Because not every place will have 25 different names, those extra columns will be filled entirely with NA values. Here, the select() function looks across all of my columns, and filters out those where all of the values in the column are NA. At the same time, I remove the country_ext, as it is truly redundant.
    # Block 6
      pivot_longer(!c(id, provincia, region_ext, country),
                   names_to = "source_seq",
                   values_to = "name") %>% 
    • Pivot this (now) very wide list into one long list using pivot_longer(). Pivot everything except id, provincia, region_ext, and country. Save the column names into a new column called "source_seq" and the city names into "name".
    # Block 7
      mutate(name = str_trim(str_to_lower(name)),
             region_ext = str_trim(region_ext),
             name = if_else(name == "", NA, name),
             country = str_to_lower(country),
             name = str_replace_all(name, "\\)", "")) %>% 
      filter(!is.na(name)) %>% 
      select(-source_seq) %>% 
      distinct() %>% 
      relocate(name, .after = id)
    
    #End script
    • Apply several mutations to clean up some of the columns:
      • Trim the whitespace with str_trim() in name and region_ext (the separation process can generate a lot of it). I also converted the names to lowercase with str_to_lower() for standardization.
      • Apply an if_else() – if the name is an empty field (represented by “”), then replace it with an NA, so it can easily be filtered out later. Else, leave whatever is in place.
      • Convert country to lower case. This should make the country names more standardized – important for joining later.
      • Replace the random right-parenthesis artifacts from the separation with blanks.
    • Filter out any NA values in the name column.
    • Remove the source_seq column. It is not needed anymore.
    • distinct() to condense all identical records into one, so each record is distinct.
    • Finally, move the name column so it is right after the id – easier for me to read.

    TM data before:

    R screenshot

    TM data after:

    So much better.

    The final step to all of this was using this list to join the no_matches records to see if I could sweep in some of the TM sources where I did not have Pleiades. I also wanted to pull the TM ID’s from the Pleiades data and put them into my set. Since this script was so long, I stored the results in an object (tris_data_long_clean) and saved it as an .rds file for safe keeping.

    # Save
    saveRDS(tris_data_long_clean, paste0(objects_directory,"tris_data_long_clean.rds"))