Using Open Libary APIs

Using Open Libary APIs

May 18, 2025

Overview

I have a personal project that I want to use my phone to scan ISBN of my books to catalog my physical books. This article is not about that project, but how to use DuckDB and using Open Library.

Open Library

Open Library is part of the Internet Archive. The nice part is that they have some documentation for using their APIs and provide the means to do bulk downloads. I decided to download the bulk data since for now I don’t want to hammer their website until I figure out how I want to setup my program and run tests. Also my thinking is that I can have the existing data and only use their API when I don’t have the data.

APIs

The APIs I plan to make use would be searching by ISBN and by author. For example, Rogue protocol from the excellent Murderbot Dairies, have a ISBN of 9781250191786. This will return a JSON that will provide information of the book.

{
  "number_of_pages": 158,
  "description": {
    "type": "/type/text",
    "value": "\"Sci-Fi's favorite antisocial AI is back on a mission.  The case against the too-big-to-fail GrayChris Corporation is floundering, and more importantly, authorities are beginning to ask more questions about where Dr. Mensah's SecUnit is.  And Murderbot would rather those questions went away.  For good.\""
  },
  "isbn_10": [
    "1250191785"
  ],
  "series": [
    "The Murderbot Diaries"
  ],
  "key": "/books/OL26966631M",
  "authors": [
    {
      "key": "/authors/OL221294A"
    }
  "isbn_13": [
    "9781250191786"
  ],
}

This is a partial of what will be returned. Notice that authors have an ID instead of the author’s name. Using OL221294A and author API will then provide Martha’s information.

{
    "name": "Martha Wells",
  "bio": "Martha Wells (born September 1, 1964) is an American writer of speculative fiction. She has published a number of fantasy novels, young adult novels, media tie-ins, short stories, and nonfiction essays on fantasy and science fiction subjects. Her novels have been translated into twelve languages. Wells has won a Nebula Award, two Locus Awards, and two Hugo Awards.",
  "alternate_names": [
    "Wells, Martha",
    "Wells, M.",
    "M. Wells"
  ],
  "links": [
    {
      "title": "Official Website",
      "url": "https://marthawells.com",
      "type": {
        "key": "/type/link"
      }
    }
  ],
}

This is a partial of what will be returned. Now I have Martha’s information. For my purposes, this should be enough to catalog my books. While I don’t have a ton of books, I would prefer to have the data locally to reduce using their APIs and to make it easier to run a bunch of tests.

Bulk Download

Now that I have an idea of the data I want, I should be able to download the following:

> wget https://openlibrary.org/data/ol_dump_authors_latest.txt.gz
> wget https://openlibrary.org/data/ol_dump_editions_latest.txt.gz
> wget https://openlibrary.org/data/ol_dump_works_latest.txt.gz

For example, according author’s documentation it should have the following properties:

This is similar to using the API. Keep in mind that total compressed size for all the files will be in the ~13GB range. Since I plan on using DuckDB, this shouldn’t be a problem.

DuckDB

While in the documentation from Open Library they mention openlibrary-search to aid with using PostgreSQL and Python, I decided to use DuckDB to help me have a reason to use and learn DuckDB more. First step is to get an idea of how the data dump looks. For now I will continue to focus on authors.

D describe from read_csv('ol_dump_authors_latest.txt.gz');
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
 column_name  column_type   null      key    default   extra  
   varchar      varchar    varchar  varchar  varchar  varchar 
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
 column0      VARCHAR      YES      NULL     NULL     NULL    
 column1      VARCHAR      YES      NULL     NULL     NULL    
 column2      BIGINT       YES      NULL     NULL     NULL    
 column3      TIMESTAMP    YES      NULL     NULL     NULL    
 column4      VARCHAR      YES      NULL     NULL     NULL    
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Initially not really helpful.

D summarize from read_csv('ol_dump_authors_latest.txt.gz');
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────┬─────────────┬──────────────────────┬──────────────────────┬───────────────┬───┬──────────────────────┬──────────────────────┬──────────┬─────────────────┐
 column_name  column_type          min                   max           approx_unique            q50                   q75            count    null_percentage 
   varchar      varchar          varchar               varchar             int64               varchar               varchar          int64     decimal(9,2)   
├─────────────┼─────────────┼──────────────────────┼──────────────────────┼───────────────┼───┼──────────────────────┼──────────────────────┼──────────┼─────────────────┤
 column0      VARCHAR      /type/author          /type/author                      1    NULL                  NULL                  14327506             0.00 
 column1      VARCHAR      /authors/OL10000001A  /authors/OL9A              15582734    NULL                  NULL                  14327506             0.00 
 column2      BIGINT       1                     691                              79    1                     1                     14327506             0.00 
 column3      TIMESTAMP    2008-04-01 03:28:5   2025-04-30 23:59:5        12590164    2020-07-01 12:02:4   2022-12-16 09:47:0   14327506             0.00 
 column4      VARCHAR      {"alternate_names"   {"works": [{"key":        12544761    NULL                  NULL                  14327506             0.00 
├─────────────┴─────────────┴──────────────────────┴──────────────────────┴───────────────┴───┴──────────────────────┴──────────────────────┴──────────┴─────────────────┤
 5 rows                                                                                                                                            12 columns (9 shown) 
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This is a little bit better. We can see that column0 is just telling us that the data contains authors. column1 is the author’s ID which will be helpful to search for the author after searching the book by ISBN. column2 not really sure if it is useful since that column is an integer and only have 79 unique values even though it appears that we have 14 million rows. column3 is a timestamp and my be when the data was entered since the range is from 2008 to the end of April 2025. Note that I downloaded the data in May 2025. Finally not sure what column0 have though looking at the min and max columns that may be JSON data.

D select * from read_csv('ol_dump_authors_latest.txt.gz') limit 10;
┌──────────────┬──────────────────────┬─────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
   column0           column1         column2        column3                                                            column4                                                    
   varchar           varchar          int64        timestamp                                                           varchar                                                    
├──────────────┼──────────────────────┼─────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
 /type/author  /authors/OL10000375A        1  2021-12-26 22:14:2   {"type": {"key": "/type/author"}, "name": "Dennis Kott", "key": "/authors/OL10000375A", "source_records": [  
 /type/author  /authors/OL10000593A        1  2021-12-26 22:41:0   {"type": {"key": "/type/author"}, "name": "Henry Frierson", "key": "/authors/OL10000593A", "source_records"  
 /type/author  /authors/OL10000780A        1  2021-12-26 23:16:3   {"type": {"key": "/type/author"}, "name": "Gabi Farkas", "key": "/authors/OL10000780A", "source_records": [  
 /type/author  /authors/OL10000964A        1  2021-12-26 23:47:4   {"type": {"key": "/type/author"}, "name": "Andols Herrick", "key": "/authors/OL10000964A", "source_records"  
 /type/author  /authors/OL10001395A        1  2021-12-27 00:57:1   {"type": {"key": "/type/author"}, "name": "Josh Bleill", "key": "/authors/OL10001395A", "source_records": [  
 /type/author  /authors/OL10001421A        1  2021-12-27 01:01:1   {"type": {"key": "/type/author"}, "name": "Alfonso Cano Farragute", "key": "/authors/OL10001421A", "source_…  │
│ /type/author │ /authors/OL10001453A │       1 │ 2021-12-27 01:02:5…  │ {"type": {"key": "/type/author"}, "name": "York County Heritage Trust", "key": "/authors/OL10001453A", "sou  
 /type/author  /authors/OL10001580A        1  2021-12-27 01:18:5   {"type": {"key": "/type/author"}, "name": "S. L. Ross", "key": "/authors/OL10001580A", "source_records": ["…  │
│ /type/author │ /authors/OL10001594A │       1 │ 2021-12-27 01:20:4…  │ {"type": {"key": "/type/author"}, "name": "Sheila S. Hatch", "key": "/authors/OL10001594A", "source_records  
 /type/author  /authors/OL10001656A        1  2021-12-27 01:33:0   {"type": {"key": "/type/author"}, "name": "Amend, Jr., William J. C.", "key": "/authors/OL10001656A", "sour…  │
├──────────────┴──────────────────────┴─────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 10 rows                                                                                                                                                                    5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Now we have a decent idea what we are looking at. Let’s verify that I can find Martha Wells.

D .columns
D select column0
        ,column1 as author_id
        ,column2
        ,column3 as created
        ,json_extract(column4, '$.name') as name
        ,json_extract(column4, '$.personal_name') as personal_name
        ,json_extract(column4, '$.alternate_names') as alternate_names
        ,json_extract(column4, '$.bio') as bio
        ,json_extract(column4, '$.location') as location
        ,json_extract(column4, '$.birth_date') as birth_date
        ,json_extract(column4, '$.death_date') as death_date
        ,json_extract(column4, '$.date') as date
        ,json_extract(column4, '$.links') as links
    from read_csv('ol_dump_authors_latest.txt.gz')
  where author_id ilike '%OL221294A%'
  limit 10;
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
     Column         Type                                                                             Row 1                                                                          
├─────────────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
 column0          varchar                                                                                                                                              /type/author 
 author_id        varchar                                                                                                                                        /authors/OL221294A 
 column2          int64                                                                                                                                                          14 
 created          timestamp                                                                                                                              2025-03-15 03:32:28.650222 
 name             json                                                                                                                                               "Martha Wells" 
 personal_name    json                                                                                                                                               "Martha Wells" 
 alternate_names  json                                                                                                                     ["Wells, Martha","Wells, M.","M. Wells"] 
 bio              json        "Martha Wells (born September 1, 1964) is an American writer of speculative fiction. She has published a number of fantasy novels, young adult novel… │
│ location        │ json      │                                                                                                                                                   NULL │
│ birth_date      │ json      │                                                                                                                                     "1 September 1964" │
│ death_date      │ json      │                                                                                                                                                   NULL │
│ date            │ json      │                                                                                                                                                   NULL │
│ links           │ json      │  [{"title":"Official Website","url":"https://marthawells.com","type":{"key":"/type/link"}},{"title":"Official Blog","url":"https://marthawells.dreamw 
└─────────────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Notice that I used .columns since I was pretty sure that I am was going to get one row, so I returned the data column-wise. Now the data is looking better. If I was going to save the data to a table, I would want to do some further cleaning.

Transform the Data

I should be able to convert some of the columns from JSON to strings. Also for the author ID, I don’t want to include /authors/. If I can do this while also not include column0, that would be great. I could then create a SQL script to either pull in all the data or I could only add the authors that I actually have their books.

D select column0
        ,column1[10:] as author_id
        ,column2
        ,column3 as created
        ,json_extract_string(column4, '$.name') as name
        ,json_extract_string(column4, '$.personal_name') as personal_name
        ,json_extract(column4, '$.alternate_names') as alternate_names
        ,json_extract_string(column4, '$.bio') as bio
        ,json_extract_string(column4, '$.location') as location
        ,json_extract_string(column4, '$.birth_date') as birth_date
        ,json_extract_string(column4, '$.death_date') as death_date
        ,json_extract_string(column4, '$.date') as date
        ,json_extract(column4, '$.links') as links
    from read_csv('ol_dump_authors_latest.txt.gz')
  where author_id ilike '%OL221294A%'
  limit 10;
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
     Column         Type                                                                             Row 1                                                                          
├─────────────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
 column0          varchar                                                                                                                                              /type/author 
 author_id        varchar                                                                                                                                                 OL221294A 
 column2          int64                                                                                                                                                          14 
 created          timestamp                                                                                                                              2025-03-15 03:32:28.650222 
 name             varchar                                                                                                                                              Martha Wells 
 personal_name    varchar                                                                                                                                              Martha Wells 
 alternate_names  json                                                                                                                     ["Wells, Martha","Wells, M.","M. Wells"] 
 bio              varchar     Martha Wells (born September 1, 1964) is an American writer of speculative fiction. She has published a number of fantasy novels, young adult novels 
 location         varchar                                                                                                                                                      NULL 
 birth_date       varchar                                                                                                                                          1 September 1964 
 death_date       varchar                                                                                                                                                      NULL 
 date             varchar                                                                                                                                                      NULL 
 links            json        [{"title":"Official Website","url":"https://marthawells.com","type":{"key":"/type/link"}},{"title":"Official Blog","url":"https://marthawells.dreamw… │
└─────────────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D 

Much better. Using json_extract_string() and string slices, I am almost there. Now I want to get rid of column0 which I should be able to using a subquery.

D select  column1[10:] as author_id
          ,column3 as created
          ,json_extract_string(column4, '$.name') as name
          ,json_extract_string(column4, '$.personal_name') as personal_name
          ,json_extract(column4, '$.alternate_names') as alternate_names
          ,json_extract_string(column4, '$.bio') as bio
          ,json_extract_string(column4, '$.location') as location
          ,json_extract_string(column4, '$.birth_date') as birth_date
          ,json_extract_string(column4, '$.death_date') as death_date
          ,json_extract_string(column4, '$.date') as date
          ,json_extract(column4, '$.links') as links
      from (select *
      from read_csv('ol_dump_authors_latest.txt.gz')
       where column1 ilike '%OL221294A%'
       )
    limit 10;

100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
     Column         Type                                                                             Row 1                                                                          
├─────────────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
 author_id        varchar                                                                                                                                                 OL221294A 
 created          timestamp                                                                                                                              2025-03-15 03:32:28.650222 
 name             varchar                                                                                                                                              Martha Wells 
 personal_name    varchar                                                                                                                                              Martha Wells 
 alternate_names  json                                                                                                                     ["Wells, Martha","Wells, M.","M. Wells"] 
 bio              varchar     Martha Wells (born September 1, 1964) is an American writer of speculative fiction. She has published a number of fantasy novels, young adult novels 
 location         varchar                                                                                                                                                      NULL 
 birth_date       varchar                                                                                                                                          1 September 1964 
 death_date       varchar                                                                                                                                                      NULL 
 date             varchar                                                                                                                                                      NULL 
 links            json        [{"title":"Official Website","url":"https://marthawells.com","type":{"key":"/type/link"}},{"title":"Official Blog","url":"https://marthawells.dreamw… │
└─────────────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D

This looks like the data I will add to my database. Also it wasn’t too slow on my machine querying a compressed file.

Conclusion

In the past I would have used Pandas to pull in the data before dumping into PostgreSQL. Pandas can still be useful for this level of cleaning however due to the size of the data, would probably need to be careful loading the data into memory. It is also certainly possible to do this with PostgreSQL but PostgreSQL is data loading is much more fragile. While not in this article, it is possible to attach PostgreSQL to DuckDB, do the work in DuckDB and save the data to PostgreSQL.

This initial analysis helps me understand the data structure and how I would like to design my program for data ingestion. DuckDB can be an excellent alternative for data ingestion and cleaning.