Time-stamped.blog

Month: February 2025

Finding Location Information with fuzzyjoin
This walk through picks up where the last post left off. If you want to use these scripts with your own data, you need to have 2 data sets, each with latitude and longitude coordinates.

Now that I had some good location data, I wanted to use my coordinates to join the Pleiades data and see what …stuff is in the immediate area of my locations. To do this, I used a package called fuzzyjoin, which has a function called geo_inner_join (and geo_right_join, geo_left_join, etc.)* which allows you to join 2 sets of coordinates based on distance – either kilometers or miles. I decided that 2km (1.24 m) would be a good place to start. The joining took several seconds and produced a long data set with each identified each Pleiades ID on its own line.
```
# Read in object.
locations_master  <- readRDS(paste0(objects_directory,"locations_reversed_joined.rds"))


#attestations based on lat/long - this will take several seconds.
locations_join <- locations_master %>% 
  filter(!is.na(lat)) %>% 
  rename("orig_lat" = lat,
         "orig_long" = longi) %>% 
  geo_inner_join(
    pleiades_from_json$pleiades_locations %>% 
      filter(!is.na(long)),
    by = c("orig_long" = "long",
           "orig_lat" = "lat"),
    unit = c("km"),
    max_dist = 2,
    distance_col = "distance_from_original"
    )
```
Steps
- I grab the object I created a few steps ago when I performed a reverse_geocode() to get the modern Country names for my location. I trim it down to the columns relevant for the join and store it in an object called locations_master.
- Filter out the NA values in lat (which should remove any incomplete coordinates).
- Rename my existing lat/long columns so that I can easily identify which frame they belong to after I perform my join.
- Perform the geo_inner_join.
  - This will look at the list pleiades_from_json$pleiades_locations and filter out any incomplete coordinates.
  - Join the 2 sets: my orig_long to Pleiades’ long, and my orig_lat to Pleiades’ lat.
  - unit = c("km") – use kilometers (km) for the units.
  - max_dist = 2 – include all matches within 2km of my coordinates.
  - Put the distance value in a column called “distance_from_original“.
- Save it all in an object called locations_join.
- * – the inner_join indicates that the script it will keep only the records that match the requirements of the join.
Results

I forgot that I didn’t have names paired with those coordinates I pulled from the JSON file. Good thing I have that object at the ready. I decided to utilize the title value to give me the basic name of the record associated with the Pleiades ID’s. The pleiades_full_json$@graph$names list has every single variation on a name in Pleiades, and is a little beyond what I needed (for the moment!) I made a set of ID’s-to-titles and saved it to my pleiades_from_json_cleaned_tidied object to keep it handy. Then I joined it to locations_join.
```
# As before:
plei_ids <- pleiades_full_json$`@graph`$id

# Grab titles.
plei_titles <- pleiades_full_json$`@graph`$title

# Pull id's and titles together.
pleiades_titles <- tibble(
  id = plei_ids,
  plei_title = plei_titles
) 

# Save these to the big list real quick.
pleiades_from_json_cleaned_tidied[["pleiades_titles"]] <- pleiades_titles

# Save the list.
saveRDS(pleiades_from_json_cleaned_tidied, paste0(objects_directory, "pleiades_from_json_cleaned_tidied.rds"))

# Join the titles to the joined set so we can see the names of the matched records.
locations_full_join_w_titles <- locations_join %>% 
  left_join(
    pleiades_titles,
    by = join_by(id == id))
```
Steps
- Get the ID’s from pleiades_full_json$`@graph`$id, store in plei_ids.
- Extract titles list and store in plei_titles.
- Pull both sets together into one tibble called pleiades_titles.
- Save this new set to the existing pleiades_from_json_cleaned_tidied list.
- Join the pleiades_titles to locations_join using the id column in both data sets.
Cool. Next I wanted to see which locations had the most attestations based on these coordinates.
```
locations_full_join_w_titles %>% 
  group_by(
    ancientcity,
    country,
    orig_lat,
    orig_long
  ) %>% 
  summarize(n_plei_records = n_distinct(id)) %>% 
  arrange(desc(n_plei_records))
```
Steps
- Group the data by ancientcity, country, orig_lat, orig_long.
- Aggregate the data with summarize() and count each distinct id in each group. Sort it so you can see the highest number of ID’s at the top. with arrange(desc()).
Results

There were some famous, well attested sites that came up, but where was Rome? Or Constantinople? I decided to widen my geo_inner_join to 5km (3.11 miles), then aggregate.
```
locations_5km <- locations_master %>% 
  filter(!is.na(lat)) %>% 
  rename("orig_lat" = lat,
         "orig_long" = longi) %>% 
  geo_inner_join(
    pleiades_from_json$pleiades_locations %>% 
      filter(!is.na(long)),
    by = c("orig_long" = "long",
           "orig_lat" = "lat"),
    unit = c("km"),
    max_dist = 5,
    distance_col = "distance_from_original"
  )

locations_5km %>% 
group_by(
  ancientcity,
  country,
  orig_lat,
  orig_long
) %>% 
  summarize(n_plei_records = n_distinct(id)) %>% 
  arrange(desc(n_plei_records))
```
These are the same steps we did before, with 5km instead of 2km.

Results

I saw more of the expected big names.

How about 10km? (Note: this code is all run together and goes right to the summary.)
```
locations_master %>% 
  filter(!is.na(lat)) %>% 
  rename("orig_lat" = lat,
         "orig_long" = longi) %>% 
  geo_inner_join(
    pleiades_from_json$pleiades_locations %>% 
      filter(!is.na(long)),
    by = c("orig_long" = "long",
           "orig_lat" = "lat"),
    unit = c("km"),
    max_dist = 10,
    distance_col = "distance_from_original"
  ) %>% 
  group_by(
    ancientcity,
    country,
    orig_lat,
    orig_long
  ) %>% 
  summarize(n_plei_records = n_distinct(id)) %>% 
  arrange(desc(n_plei_records))
```
Results

More obscure cities with tons of records. It also took a little longer to perform the join. The wider the search, the more location overlap I was likely to have. I decided that 5km might be the highest tolerance to apply in my search. And, at some point, it might be interesting to check out which ID’s my records shared in common.

Just for fun, I checked out Dura, the 2nd most attested location. Sure enough, there is a ton of archaeological evidence and study in the region, which has produced an extensive amount of research. Pleiades links much of it in their record. What a great starting point!

Portion of the main Pleiades record for Dura with all associated sites plotted.

All these references!

Then, I wanted to check on the locations that did not produce any matches.
```
# Who did not have matches?
no_matches <- locations_master %>% 
  left_join(locations_5km %>% 
    group_by(
      locationID) %>% 
      summarize(n_plei_records = n_distinct(id),
                .groups = "drop"),
    by = join_by(locationID == locationID)) %>% 
  filter(is.na(n_plei_records))
```
Steps
- Take my original locations_master and left_join() to the 5km data set. Since we just want a quick check of number of records, inside that left_join(), group the records to locationID and then get the total number of distinct Pleiades ID’s per locationID.
- Join by the locationID in each data set.
- filter(is.na(n_plei_records)) to focus on the records that did not have any ID’s found for it’s coordinates.
Results

This place at the top of my NA locations, Nevers, is absolutely a real place, and when I checked, had a record in Pleiades. The problem was my coordinates were a little outside the area that Pleiades defined as the representative boundary. Really annoying. To resolve this, I decided that next I would attempt to source the data via the names. I’ll post that up when I am done working through it. The names list huge!

Finally, save the locations_5km and the locations_master objects for the next round of matches.
```
saveRDS(locations_5km, paste0(objects_directory,"locations_5km.rds"))
saveRDS(locations_master, paste0(objects_directory,"locations_master.rds"))
saveRDS(no_matches, paste0(objects_directory,"no_matches.rds"))
```
Follow along code is up on Github.

References and Resources

Robinson D (2025). fuzzyjoin: Join Tables Together on Inexact Matching. R package version 0.1.6, https://github.com/dgrtwo/fuzzyjoin.
2025-02-26
Wrangling JSON Data
This walk through assumes you have some sort of deeply nested data you sourced from a JSON file. Here, I am specifically working with the Pleiades JSON data-dump that I loaded in this post. You can download it yourself, here: https://atlantides.org/downloads/pleiades/json/ (use the pleiades-places-latest.json.gz file. If you do not have a program to decompress the file, you can use 7-zip to unpack it).

Once again, I return to Pleiades, and my quest for excellent source data. When I first downloaded and explored their JSON data, it was confusing to me. Even with the flattening applied, the result was one, massive list, divided into 2 main groups. Those lists had more lists, several layers deep in some sets. But even then, it was easy to see how the data lined up.

I clicked on the list to bring up the R Studios view tab. This allowed me to collapse and expand the lists to get a sense of what was in there. If you want to keep it all in-console, use glimpse(), then use the $ to select a list from the level below. You can use $ for as many levels of data there are in the list.

R Studio – View(pleiades_full_json)– view of the file with the @graph list expanded.

glimpse(pleiades_full_json$@graph) – view of the same list. <list> types shown, then individual structures are shown with dimensions.

What I really wanted was the “id”, or the unique record ID that identifies every place in Pleiades. That should allow me to pair those ID’s up to any of these data sets and maintain the fidelity of the information. I have always wanted those references, so I went for that, first.
```
# Pull out the lists into their own objects.
plei_ids <- pleiades_full_json$`@graph`$id 
plei_refs <- pleiades_full_json$`@graph`$references 


# Pull them together into one tibble.
pleiades_references <- tibble(
  id = plei_ids,
  references = plei_refs
) %>% 
  unnest(references)
```
Steps:
- Pull out both the id list and the references list from the full list, then convert them into their own objects called plei_ids and plei_refs, respectively.
- Pull the 2 objects together in one tibble at the same time. The id will pair up to the respective records in the references table. Because one id may refer to multiple references, a nested value is created in the new tibble for the references column. unnest(references) will expand these into their own rows.
- Note: places with many references will therefore generate many new rows. If there are more lists, they will show up as they have previously, nested inside the columns. It looks like that is not the case here, since my URL field has been converted to <chr>:
The data set with the above steps applied.

Did a little spot checking.
```
# Pull out the ID's and pick 3 at random.
spot_check_ids <- pleiades_references %>% 
  select(id) %>% 
  distinct() %>% 
  slice_sample(n = 3)

# Filter to ids and check what references are attached to them.
pleiades_references %>% 
  filter(id %in% spot_check_ids$id)
```
Steps:
- select() only the id column and use distinct() just to make sure I am getting 100% unique id’s. I then sample 3 of them at random using slice_sample(n = 3).
- filter() to only those id’s chosen and see what URL’s are tied to them.
- Check each of those id’s on https://pleiades.stoa.org/ and make sure the same references are being identified under the References section of the place’s page. I did this a few times to make sure.
I thought this looked good! At some point, I want to use these references to do some more exploration. There are a ton of other database systems linked in here (like Trismegistos!) that would be really fun to poke at.

Next, I wanted to check out the Locations data. At first glance, the data looked bonkers – tons of nested tables. However, it’s really not that bad once you understand what’s in the set. Many of these nested tables were available in the top-level of the list, which would allow you to link the id directly to the value, rather than trying to unnest the columns in the larger set – exactly as I did with the references set above. I was interested in the representative coordinates (latitude, longitude), which are stored in pleiades_full_json$@graph$reprPoint.
```
# Create an object from the reprPoint list.
plei_rep_locs <- pleiades_full_json$`@graph`$reprPoint

# Pull id and location data together. Make the data clean and tidy.
pleiades_locations_id_match <- tibble(
  id = plei_ids,
  location_data = plei_rep_locs
) %>% 
  unnest_wider(location_data,
               names_sep = "_") %>% 
  rename("long" = location_data_1,
         "lat" = location_data_2)
```
Steps:
- Store an object called plei_rep_locs with the data from pleiades_full_json$`@graph`$reprPoint.
- The next steps flow together to make the final location set:
  - Create a tibble called pleiades_locations_id_match that combines plei_ids and location_data (as we did before).
  - The location_data comes over as a double-valued list. If you unnest(), you will create a new record for each coordinate (latitude will be on one line, longitude under it). That is not helpful. Instead, we can use unnest_wider() to coerce those values into 2 columns.
    
    names_sep = "_" tells unnest to use the column name as the base name for the new columns, and apply an underscore ( _ ) to separate the name and the column position of the set. Since I only have 2 coordinates, this will create location_data_1 and location_data_2.
  - Finally, rename the columns so you know for sure which coordinate is which.
I did this process with a few more lists and was able to get at an incredible amount of data. I saved the ones that caught my interest to a list to keep things accessible and organized.
```
# Create a blank list.
pleiades_from_json_cleaned_tidied <- list()

# Add to the list.
pleiades_from_json_cleaned_tidied[["pleiades_locations"]] <- pleiades_locations_id_match
pleiades_from_json_cleaned_tidied[["pleiades_references"]] <- pleiades_references

# Save the list.
saveRDS(pleiades_from_json_cleaned_tidied, paste0(objects_directory, "pleiades_from_json_cleaned_tidied.rds"))
```
Steps:
- Create an empty list(). Here mine is called pleiades_from_json_cleaned_tidied.
- Add your sets to the list. The name inside the brackets [[ ... ]] will be the name inside your list. Name it something meaningful.
- Save as an .rds. Here, I have a predefined value for my objects directory, but this can be any place you want to save the file. Again, name the .rds something meaningful.
Follow along code is up on Github.

Next up: different joining methods to match my locations to attestations through location and name.
2025-02-20
Reading in JSON data with jsonlite
Many data come in JSON format, which can be read into and manipulated in R. Here’s how to get started!

Pleiades and Trismegistos are amazing resources for anyone who is interested in places of the ancient world. They both collect an amazing amount of data and information and make it freely available for anyone to use.

Pleiades is a very robust online gazetteer (an index of places) housed at the Ancient World Mapping Center. It is an excellent starting place for research on ancient places. From the website entrance, you type in a search term and it will present you with a list of possible matches. Clicking on a link brings you to a site with a ton of information, and all of it cited from reputable sources, and all clickable. Good stuff!

Trismegistos is another website with great, well sourced information about ancient places. Again, you enter your place name and search for the matches. I selected Trismegistos as a source when it kept popping up in my Google searches during research. It was often able to locate places that were not available in Pleiades. It also has a really extensive set of data on the alternative names each place might have. That said, the great majority of my places had entries in both datasets. In fact, both Pleiades and Trismegistos include each other in their standard set of reference output.

What is the best, though, is that through Creative Commons licensing, you are free to download and use these data sets as you like. Both offer a variety of formats, depending on what you want to access. For Pleiades, I snagged the JSON set – because I wanted to learn about extracting JSON data, and it seems to be the only place where I can get at the very extensive reference lists Pleiades provides. It is also the most consistently up-to-date. On the Trismegistos side, I downloaded the Geographical data dump, available in .csv, which contains all of the current entries for Trismegistos geographical data.

Reading the sets

Using .csv files is straightforward: you read the data in, and if it’s mostly clean, the columns and rows will come in as you would expect them to. I was able to read in Trimegistos’ data without issue. The JSON data from Pleiades is structured differently, and if you are not familiar with these files, they’re not immediately intuitive. I am pretty new to using them, myself and it was a bit of a learning curve. Thankfully, R has a package, jsonlite, that makes pulling that data out of the JSON really easy.
```
# Read in trismegistos data dump .csv
trismegistos_all_export_geo <- read.csv(paste0(data_sets,"trismegistos_all_export_geo.csv"))

# Read in json file
pleiades_full_json <- read_json(paste0(data_sets,"/pleiades-places.json",
                       simplifyDataFrame = T,
                       flatten = T)
```
Steps:
- Use read.csv() to read in the Trismegistos .csv file and save it as an object called trismegistos_all_export_geo.
- Use read_json() from the jsonlite package to read in the .json file. This was a huge file and took between 10-15 minutes to finish loading. Save this as an object called pleiades_full_json.
  - simplifyDataFrame = T, puts lists containing only records (JSON objects) into a single data frame;
  - flatten = T, flattens nested data frames into a single data frame if possible;
The tidying and wrangling of this data was a lot more involved and took some time due to the nested nature of the JSON data. Nevertheless, R helped me get the data in a format I could access easily. For anyone who wants to dive into the minutiae of what Pleiades and Trismegistos has to offer, they should download and extract the respective datasets and walk through the above scripts to import it. Then, if you feel like it, follow me over to the wrangling post where I go through cleaning and tidying elements of these sets.

Citations:
- Roger Bagnall, et al. (eds.), Pleiades: A Gazetteer of Past Places, 2016, <http://pleiades.stoa.org/> [Accessed: February 17, 2016].
- H. Verreth, A survey of toponyms in Egypt in the Graeco-Roman period (Trismegistos Online Publications, 2), Leuven: Trismegistos Online Publications, 1253 pp.
- Ooms J (2014). “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [stat.CO]. https://arxiv.org/abs/1403.2805.
2025-02-17
Using tidygeocoder’s reverse_geocode()
How to use tidygeocoder’s reverse_geocode() function to verify locations and place names. If you want to use your own data, make sure it has longitude and latitude variables available. I am also making my own locations data set available here: https://github.com/timestamped-blog/follow_alongs/blob/main/early_xty_locations.csv. This is a somewhat cleaner version of locations_master that I am using below.

Way back when I was actively collecting my data, I was obsessed with making sure that the places associated with my entries both existed, and were locatable. I spent a lot of time tracking down place names, alternative names, modern names, whatever was out there. Eventually, the list became very long, with tons of duplicate entries (like, did you know that a lot of early Christian events took place in Rome? Who knew!) I realized I had to organize this nonsense, and decided to build a mySQL database to store it all. I took all my city/country locations and consolidated them to one-record-per-location, assigned a unique ID for that particular place, then created a table for it in database. This is currently how all of my location data exists and is the most up-to-date.

But, I wanted to make sure my locations were right. I felt some of my modern country info was dicey, and I wasn’t sold that all of my coordinates were right. I had done way too much copying and pasting over the years! To do this, I decided to do a reverse lookup on my coordinates, then check the countries that were generated. For that, I used a package called tidygeocoder which has a function: reverse_geocode(), which will allow you to feed in a set of coordinates, and return a (very) wide list of data about that place. I was mainly interested in the country variable given by Open Street Map (OSM).

Here are how my locations looked as I downloaded them from mySQL so you have a sense of my variables (the set is called locations_master). In addition to tidygeocoder, I had (as always) tidyverse loaded to help with this work.

I wanted to grab my latitudes (lat) and longitudes (longi) and feed them into the reverse_geocode() function, then apply some other useful arguments. I stored it all in an object called “rev_geo“. The whole thing looked like this:
```
rev_geo <- locations_master %>%
  filter(!is.na(lat)) %>% 
  reverse_geocode(
    lat = lat,
    long = longi,
    method = "osm",
    full_results = TRUE,
    custom_query  = list("accept-language"="en-US")
  )
```
Steps:
- First, I filter out any NA values in my coordinates, since it will cause the OSM lookup to error out.
- lat = takes in my latitude (lat) value; long = takes in my longitude (longi) value.
- method = "osm" indicates the OSM geocoding service. A full list of supported services can be found here: https://jessecambon.github.io/tidygeocoder/articles/geocoder_services.html
- full_results = returns a wide data set of various bits of information surrounding your coordinates.
- custom_query = in this instance, instructs returning the results in US English. Without this applied, many of my addresses came back in their native languages, which was cool as hell, but difficult to read! Lucky for me, someone else had this issue and asked StackOverflow about it awhile ago.
When reverse geocode finished, I wanted to check the matches. I created a quick a script to check my work and see which countries matched up.
```
rev_geo %>% 
  select(locationID, 
         modernCountry, 
         country, 
         address) %>% 
  mutate(country_check = if_else(modernCountry == country, T, F)) %>% 
  filter(country_check == F)
```
Steps:
- Select only the columns I want for readability.
- Create a column with mutate() that checks if the modernCountry matches the country I pulled in. The if_else tests if modernCountry == country. If it does, it returns TRUE, else it it returns FALSE.
- Filter only the FALSE records.
All of my entries were instances where the names were different, but correct (i.e: Britain vs. United Kingdom); regions that used to stretch through many countries (i.e.: Roman Mauretania), and some of the cities rest on the borders. I thought it looked good and decided to pause. Since the geocode took awhile, I wanted to save it so that I had it on hand for the next steps. I stashed it in my “objects” directory where I save all my .rds files.
```
saveRDS(rev_geo, paste0(objects_directory,"locations_reversed_joined.rds"))
```
Now that I am confident my locations look good, I want to pair them up with a couple data sets: Pleiades and Trismegistos, that will provide me almost every attestation I would need. But before I can do that, I need to mine the data from both sites and make them usable in the R environment. The next several posts walk through loading, wrangling, and tidying data from JSON files and .csv data dumps.

If you would like to walk through a sample of my data set using the scripts above, I have a follow along script located on time-stamped’s Github.

Citations:

Cambon J, Hernangómez D, Belanger C, Possenriede D (2021). tidygeocoder: An R package for geocoding. Journal of Open Source Software, 6(65), 3544, https://doi.org/10.21105/joss.03544 (R package version 1.0.5)
2025-02-14