Time-stamped.blog

Month: March 2025

Map Plotting Basics
This begins a new set of posts around plotting locations on maps using R. To follow along with this post, you need your own data file with longitudes and latitudes, uniquely identified. I will be using my own set of data for this, but you can follow along using the locations data I posted on Github. There are .csv and .rds files available.

Finally, I had a set of neat, tidy locations. It was time to do something with them. Mapping seemed like a logical next step – but where to start? The foundation of churches and dioceses was a data point that popped up often in my research, so I thought that would be a good place to jump in.

In simple terms, a diocese or see, is the head of Catholic churches in a region, and is overseen by a Bishop. As Christianity grew, so did the establishment of churches, then dioceses to oversee them, and finally archdioceses, and patriarchal sees to oversee everything under them. There are many dioceses that were erected very early on, and indeed count apostles of Christ as their founders. An obvious example is Rome. The diocese is traditionally said to be founded and led by the first Pope, the apostle St. Peter, before his death in ~64.

My data came from 2 main sources: the diocese event list at Catholic Hierarchy, and via articles in the Catholic Encyclopedia housed at New Advent. These are websites that use scholarly, historical, and traditional sources for their information, and I felt comfortable using them for general information about the establishment of early dioceses. I went about getting the data ready to use in ggplot’s geom_sf() function, which would be used to plot the points onto a map.

In this post, I am not going to share the code or the data that I used to produce my locations list – mainly because it involves operations I have covered extensively (join and bind functions). As I said at the start, If you want to follow along, you just need a file or object that has individually identified longitude and latitude columns, or use one of the sets I link above.

For me, map visualizations were a brand new thing, and were not as straightforward as other visualizations have been. There were extra components: loading shapefiles, cropping to the area you want to focus on, making sure all the coordinate systems match, and also the use of simple features (sf) instead of x,y variables.

What are simple features? From what I could tell, sf’s are how spatial data and geometry are stored and used in a variety of data systems, especially geographic information systems (GIS). Sf’s and their coordinates can represent a lot of area-defining…things (“features”, if you will). These can be points on a map, a series of points that make a path or road, the span of an area, or its boundaries. I was primarily going to be using points, using only a longitude and a latitude.

There are many map plotting packages out there. After poking around, I settled on sf, a package that allows extensive manipulation and transformation of spatial data. It had easy to understand functions that would read my data and make it available to me for analysis and plotting.

In the very first place, I had to get my shapefiles. Shapefiles are vectorized files that contain the spatial information needed to produce a map layer for the plot. After poking around, I downloaded a map package from Natural Earth Data and unzipped it.
```
# This needs to be ran to prevent errors with rendering the world map below. This turns off 
# spherical geometry package.
sf_use_s2(FALSE)

# Load packages.
library(tidyverse)
library(sf)

# Read in your world shapefile. 
# These were downloaded from https://www.naturalearthdata.com/
world <- st_read("... directory where code resides... /ne_10m_admin_0_countries/ne_10m_admin_0_countries.shp")  
```
Steps
- Load packages tidyverse and sf.
- Use st_read() from the sf package to read in the shapefile (.shp).
- Important: Keep the file structure from the .zip file in tact. The file folders need to reside in the same directory as your code for it to function correctly.
Next, I had to do a little prepping of my data so it could be read by the plotting function I was going to apply. Again, the sf package provided the functions to get things in working order.
```
# Check coordinates that the world map uses.
st_crs(world)

# Make sf object for plotting.
diocese_sf <- diocese_1 %>% 
  dplyr::filter(status %in% c("Established","in_description","Erected")) %>% 
  rename("lon" = longi) %>% 
  dplyr::filter(!is.na(lat)) %>% 
  dplyr::select(locationID, lon, lat) %>% 
  st_as_sf(coords = c("lon", "lat")) %>% 
  st_set_crs(st_crs(world))
```
Steps
- Filter my records only to the diocese dates that reflect the start or establishment of a diocese ("Established", "in_description", "Erected").
- Other quick cleaning: fix the longitude column name so it can be read by the sf functions, filter out the NA values, and select only the necessary columns: locationID, lon, lat.
- With st_as_sf(), I specify that I am using longitude and latitude coordinates.
- Use st_set_crs() to match the coordinate system of world.
I passed the new object into ggplot and used the geom_sf() function to render both the shapefile and my location data. The resulting plot was not helpful.

Not helpful!

I needed to crop the map. Finding my boundaries was easy, and the sf package had a cropping function that would constrain the map to a certain area. I just needed to find the min/max values of my coordinates, then use st_crop() to apply the boundaries.
```
# Find min/max of my location data for cropping to the area that holds all of my points.
diocese_1 %>% 
  filter(!is.na(lat)) %>% 
  summarize(min_lon = min(longi),
            max_lon = max(longi),
            min_lat = min(lat),
            max_lat = max(lat))

# Create the cropped polygon object
world_cropped <- st_crop(world,
        xmin = -10, xmax = 48, 
        ymin = 23, ymax = 57)
```
Steps
- Filter the incomplete coordinates out of diocese_1 by removing NA values with !is.na(). Create a basic summary with the min and max of the lat and longi values.
- Using st_crop(), create a cropped polygon object using the coordinates found in the previous step. I slightly decreased/increased the min/max values to give some breathing room to the points on the edges of the map
Now plot!
```
ggplot() +
  geom_sf(data = world_cropped,
          fill = "navajowhite") +
  geom_sf(data = diocese_sf,
          alpha = 0.5)
```
Steps
- Call ggplot(). Here, I do not need to define aesthetics. X, Y values will come from diocese_sf.
- With geom_sf(), add the world_cropped object as the first layer. I used navajowhite to color the land mass.
- In the second geom_sf(), add the diocese_sf object. I adjusted the opacity to 50% with alpha = 0.5 to make the clustered points more visible.
Results

Much better! I immediately saw some interesting patterns in my points – something to explore, next.

Follow along script is on Github.

References and Resources
- Cheney, David M. (2025) “Catholic-Hierarchy: Its Bishops and Dioceses, Current and Past.” Copyright David M. Cheney, 1996-2025. https://www.catholic-hierarchy.org/.
- “CATHOLIC ENCYCLOPEDIA: Home.” (2025). https://www.newadvent.org/cathen/.
- Pebesma, E., & Bivand, R. (2023). Spatial Data Science: With Applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016
- “Natural Earth – Free Vector and Raster Map Data at 1:10m, 1:50m, and 1:110m Scales.” (2025). https://www.naturalearthdata.com/.
2025-03-23
Tangling with Location Names, part 2
Once I had my long list of names, joining my no_matches set was very easy. I used an inner_join() and pulled in any record from tris_data_long_clean that matches my locations, both in name and in country. I was able to snag several more records.
```
tris_matches_2_no_matches <- no_matches %>% 
  mutate(ancientcity = str_to_lower(ancientcity),
         country = str_to_lower(country)) %>%
  inner_join(
    tris_data_long_clean,
    by = join_by(ancientcity == name,
                 country == country)) 
```
Steps
- Apply str_to_lower() on both ancientcity and country. This helps standardize the name in case any random capitalizations exist, and assists in the inner_join().
- inner_join() to match (exactly) on the name, and also the country, to ensure we have the right place. If there is no 1:1 match on name/country, then the record will be dropped. These locations will be re-joined later.
Results

41 more records – not bad!

After this, I had 2 sets I wanted to bring together: locations_5km – the set joined by coordinates with Pleiades IDs; and then the new set generated from no_matches with TM IDs. I wanted to pull all of them together into one, tidy data frame. I remembered that Pleiades uses Trismegistos as one of their standard sources, and to keep things consistent, I decided to pull over the available TM ID’s from the Pleiades data. I also pulled over the Roman provincia, if it was available.

The following code, as in the post before, is broken up, but is actually one long script, meant to be run all at once.
```
# Start Script

# Block 1
# Bind the 2 data sets together by rows.
locations_plei_and_tris <- locations_5km %>% 
  rename("plei_id" = id) %>% 
  select(-long,
         -lat) %>% 
  bind_rows(
    tris_matches_2_no_matches %>% 
      rename("orig_lat" = lat,
             "orig_long" = longi,
             "tris_id" = id)) %>% 
```
Block 1
- With locations_5km, I rename the ID column to plei_id to specify that it is the Pleiades ID.
- Remove the long and lat column with select() – I only want to use my original coordinate values. I will keep the distance_from_original value for reference.
- Join the 2 sets: locations_5km and tris_matches_2_no_matches, by row with bind_rows(). This will “stack” the rows, combining like columns together, and pulling over whatever columns that are not in common between the 2 sets, and filling them with NA values. Within the bind, I clean up some column names within tris_matches_2_no_matches/
Next was a series of left_joins() to pull in the TM ID’s for the existing Pleiades ID’s I had and the provincia information for those ID’s.
```
  # Block 2
  left_join(
    pleiades_from_json_cleaned_tidied$pleiades_references %>% 
      filter(grepl("trismegistos", accessURI, ignore.case = T)) %>% 
      mutate(tris_id_extract = str_extract(accessURI, "[0-9]+")) %>% 
      select(id, tris_id_extract),
    by = join_by(plei_id == id)) %>% 
```
Block 2
- With a left_join(), I access the pleiades_references inside the JSON list, a set I made previously. The next steps are done inside the join.
  - filter() using a grepl() function. This will let you look at values in the accessURI column that contain the string “trismegistos”. I also apply ignore.case = (just to be safe). This should pull all of the Trismegistos URLs.
  - Create a new column with mutate() that will use str_extract() to pull only the ID from the URL. The "[0-9]+" is a regex expression that tells the function to only extract any number 0-9, and any additional number after it. This gives me a nice neat column with only the TM ID in tris_id_extract.
  - Select only the tris_id_extract and the (Pleiades) id.
- Finally, join the 2 sets by the ID columns.
```
  # Block 3
  left_join(
    tris_data_long_clean %>% 
      select(id, 
             provincia, 
             country) %>% 
      mutate(id = as.character(id)) %>% 
      distinct(),
    by = join_by(tris_id_extract == id)
  ) %>% 
```
Block 3
- Apply another left_join() with tris_data_long_clean (a set made in a previous step) to get the Roman provincia names. The next steps are also done inside the join.
  - Select only the id, provincia, and country columns.
  - With mutate(), convert the id to character format to facilitate the join.
  - Generate only distinct records.
- Finally, join the 2 sets by the ID columns.
```
  # Block 4
  mutate(trismegistos_id = coalesce(as.character(tris_id), tris_id_extract),
         trismegistos_provincia = coalesce(provincia.x, provincia.y)) %>% 
  rename("country" = country.x) %>% 
  select(-provincia.x,
         -region_ext,
         -tris_id_extract,
         -provincia.y,
         -country.y,
         -tris_id) 

# End script.
```
Block 4
- After the data was joined together, I have TM ID’s, in 2 columns, from the no_matches and from where I extracted the ID’s from the reference URL’s. I use coalesce() to pull them together into one column: trismegistos_id.
- Perform the same step on the provincia columns to create trismegistos_provincia.
- rename() country.x to country.
- Remove the redundant columns carried over.
Results

Getting close! I had all my distance-based location matches, and the extra name-based matches. I also had the TM ID’s for the Pleiades records that were pulled over on the geo_inner_join() from before. Just one more thing, pull in whatever of my original records where no location match was ever found.
```
locations_cited <- locations_master %>% 
  rename("orig_region" = ancientregion,
         "orig_country" = country) %>% 
  left_join(
    locations_plei_and_tris %>% 
    select(-ancientcity,
           -ancientregion,
           -country,
           -orig_lat,
           -orig_long),
    by = join_by(locationID == locationID))  
```
Steps
- In the original locations_master, apply better names to reflect which source the region and country came from – assign with rename().
- left_join() to the locations_plei_and_tris object that was just created. Inside the join, remove what would be redundant data. Any locationIDs that do not have a match will have NA values in both the plei_id and trismegistos_id fields.
- Join the 2 sets by locationID. Store in the object locations_cited.
Quick look at some of the records that didn’t come over. These will require some more research or fuzzy matching to sweep them up. Fortunately, there were only 109 orphaned records out of my original set of ~1,300.

I also noticed that some of my locations independently pulled in a Pleiades ID and a TM ID, creating a duplicate row in some cases. I will have to figure out how suppress that in future iterations.

Despite that, I decided that this is really, really good locations data set, and that it would be OK to pause here and move onto something else. I was interested in taking one of my data sets, linking it to my locations data, and plotting it on a world map. Fortunately, I had a lot of other records tied to these locationID’s where I could do that.

Follow along script for parts 1 and 2 (in one file) are up on Github.
2025-03-12
Tangling with Location Names, part 1
As before, this walk through picks up where the last left off. These scripts are meant to assist in separating combined fields, where unique data has been separated by 1 or more delimiters (separators). It also demonstrates pivoting data into a long format, so that it can be used in joining different data sets, based on like values, into one. I have also downloaded the .csv format of Trismegistos’ Geographical data dump to help locate names for the locations that did have matches in Pleiades. This is not a large download and requires no unzipping. They also offer the data in JSON format if you want to practice more with that.

I have the following objects loaded. These were generated in previous steps and saved as .rds files
```
pleiades_from_json_cleaned_tidied #Cleaned data from Pleiades JSON data
locations_5km  #Locations within 5km of the original coordinates.
locations_master #My original locations list.
no_matches #Locations that did not find a match during the fuzzyjoin.
```
What I wanted to do next was try to find locations that were not matched on the geo_inner_join(), and see if I could get attestations through Trismegistos (“TM” from now on) based on the name. I came to find out that was easier said than done.

The TM data was not tidy. There were multiple names, stored in many fields, across several columns. There was a composite variable (data created from combining 2 other fields): full_name that had additional info in it, so I could not ignore the column as redundant. The Latin names often had more than one known name, and were separated by a series of underscores, some extremely long. To illustrate this general untidiness, here is the TM record for one of my no_matches locations.

In the end, I wanted to have a long, tidy, list with a single row for each name instance, its Roman provincia, and country, all tied to its TM ID. That way I could perform a join on the city names and countries, and pull in more records, but also join the TM information to my final set.

Though these next sections are broken out, the script is meant to be one, long script, ran all at the same time. I did a series of separations using separate_wider_delim() to break out the data I needed, then a pivot_longer() to get everything in one long set. Finally, I did a little cleaning to get the names to a more standardized state. My location names would all be Romanized or in English, so I did not feel the need to include the Greek, Egyptian, or Coptic name translations. It is there if I ever want it, though.
```
# Start script

# Block 1
# Start big cleanup of tris data
tris_data_long_clean <- trismegistos_all_export_geo %>% 
  as_tibble() %>% 
  select(id,
         provincia,
         country,
         name_latin,
         name_standard,
         full_name) %>% 
 separate_wider_delim(
    full_name,
    delim = " - ",
    names = c("country_region_ext","city_ext"),
    too_many = "merge",
    too_few = "align_start") %>% 
```
Block 1
- Note: trismegistos_all_export_geo is the object loaded from the TM .csv file referenced above. From this, use select() to pick only the necessary variables/columns.
- Render the data set as a tibble for easier use/viewing.
- Start the first separation. It looks like there is only one initial break marked with space-underscore-space ( – ), which will be the delimiter in delim = . The next steps are done inside the separation.
  - Specify the full_name column, and the delimiter.
  - Separate column into 2 columns: "country_region_ext" and "city_ext".
  - In the event there is more or less data than can fit in 2 columns, what to do with those values . If there are more than 2 (too_many), the remaining values are merged into the 2nd column with separators in tact. If there is 1 value (too_few), it will be placed in country_ext, the start of the created columns.
```
# Block 2
  separate_wider_delim(
    country_region_ext,
    delim = ",",
    names = c("country_ext","region_ext"),
    too_many = "merge",
    too_few = "align_start") %>%
```
Block 2
- Do another separation, just like above, this time breaking country_region_ext at the commas to get the country and regions alone.
```
#Block 3
mutate(city_ext = str_replace_all(city_ext, " \\[", "(")) %>% 
```
Block 3
- Mutate the city_ext column using str_replace_all(). The inconsistent bracket vs. parenthesis make the next separation difficult. So, I want to replace left-brackets with left-parenthesis. The \\ with the left-brackets are escape characters, and need to be used when working with certain characters inside R.*
- * – stringr is a package inside the tidyverse. It’s a great set of functions that I use a lot for manipulating strings in R. I like to use this cheatsheet for help with understanding how different characters are handled inside package, and which need escaping.
```
# Block 4
  separate_wider_delim(
    city_ext,
    delim = "(",
    names = c("city_ext_1","city_ext_2"),
    too_many = "merge",
    too_few = "align_start") %>%
```
Block 4
- Yet more separations. Just like 2 and 3 above, this time breaking city_ext at the parenthesis. This allows me to get at the appended information I discussed above.
```
# Block 5
  separate_wider_delim(
    name_latin,
    delim = " - ",
    names = paste0("tris_latin_", 1:25),
    too_many = "merge",
    too_few = "align_start") %>% 
  select(where(~ !all(is.na(.))),
         -country_ext) %>% 
```
Block 5
- Some of these names have, like, 20+ delimiters (all consistent, thank God!), so I separate them into a large number of columns – more than I probably need. To do this, I use paste0() to affix a “tris_” to the columns, then a sequence of 1-to-whatever I want. Here, 25 – seemed like plenty. This results in 25 columns named tris_1 through tris_25, plenty of room for separation.
- Because not every place will have 25 different names, those extra columns will be filled entirely with NA values. Here, the select() function looks across all of my columns, and filters out those where all of the values in the column are NA. At the same time, I remove the country_ext, as it is truly redundant.
```
# Block 6
  pivot_longer(!c(id, provincia, region_ext, country),
               names_to = "source_seq",
               values_to = "name") %>% 
```
Block 6
- Pivot this (now) very wide list into one long list using pivot_longer(). Pivot everything except id, provincia, region_ext, and country. Save the column names into a new column called "source_seq" and the city names into "name".
```
# Block 7
  mutate(name = str_trim(str_to_lower(name)),
         region_ext = str_trim(region_ext),
         name = if_else(name == "", NA, name),
         country = str_to_lower(country),
         name = str_replace_all(name, "\\)", "")) %>% 
  filter(!is.na(name)) %>% 
  select(-source_seq) %>% 
  distinct() %>% 
  relocate(name, .after = id)

#End script
```
Block 7
- Apply several mutations to clean up some of the columns:
  - Trim the whitespace with str_trim() in name and region_ext (the separation process can generate a lot of it). I also converted the names to lowercase with str_to_lower() for standardization.
  - Apply an if_else() – if the name is an empty field (represented by “”), then replace it with an NA, so it can easily be filtered out later. Else, leave whatever is in place.
  - Convert country to lower case. This should make the country names more standardized – important for joining later.
  - Replace the random right-parenthesis artifacts from the separation with blanks.
- Filter out any NA values in the name column.
- Remove the source_seq column. It is not needed anymore.
- distinct() to condense all identical records into one, so each record is distinct.
- Finally, move the name column so it is right after the id – easier for me to read.
Results

TM data before:

TM data after:

So much better.

The final step to all of this was using this list to join the no_matches records to see if I could sweep in some of the TM sources where I did not have Pleiades. I also wanted to pull the TM ID’s from the Pleiades data and put them into my set. Since this script was so long, I stored the results in an object (tris_data_long_clean) and saved it as an .rds file for safe keeping.
```
# Save
saveRDS(tris_data_long_clean, paste0(objects_directory,"tris_data_long_clean.rds"))
```
Let’s finally wrap this. Head over to part 2 to see how that was done.

Follow along script for parts 1 and 2 (in one file) are up on Github.
2025-03-10