Using Open Libary APIs

May 18, 2025

Overview

I have a personal project that I want to use my phone to scan ISBN of my books to catalog my physical books. This article is not about that project, but how to use DuckDB and using Open Library.

Open Library

Open Library is part of the Internet Archive. The nice part is that they have some documentation for using their APIs and provide the means to do bulk downloads. I decided to download the bulk data since for now I don’t want to hammer their website until I figure out how I want to setup my program and run tests. Also my thinking is that I can have the existing data and only use their API when I don’t have the data.

APIs

The APIs I plan to make use would be searching by ISBN and by author. For example, Rogue protocol from the excellent Murderbot Dairies, have a ISBN of 9781250191786. This will return a JSON that will provide information of the book.

{
  "number_of_pages": 158,
  "description": {
    "type": "/type/text",
    "value": "\"Sci-Fi's favorite antisocial AI is back on a mission.  The case against the too-big-to-fail GrayChris Corporation is floundering, and more importantly, authorities are beginning to ask more questions about where Dr. Mensah's SecUnit is.  And Murderbot would rather those questions went away.  For good.\""
  },
  "isbn_10": [
    "1250191785"
  ],
  "series": [
    "The Murderbot Diaries"
  ],
  "key": "/books/OL26966631M",
  "authors": [
    {
      "key": "/authors/OL221294A"
    }
  "isbn_13": [
    "9781250191786"
  ],
}

This is a partial of what will be returned. Notice that authors have an ID instead of the author’s name. Using OL221294A and author API will then provide Martha’s information.

{
    "name": "Martha Wells",
  "bio": "Martha Wells (born September 1, 1964) is an American writer of speculative fiction. She has published a number of fantasy novels, young adult novels, media tie-ins, short stories, and nonfiction essays on fantasy and science fiction subjects. Her novels have been translated into twelve languages. Wells has won a Nebula Award, two Locus Awards, and two Hugo Awards.",
  "alternate_names": [
    "Wells, Martha",
    "Wells, M.",
    "M. Wells"
  ],
  "links": [
    {
      "title": "Official Website",
      "url": "https://marthawells.com",
      "type": {
        "key": "/type/link"
      }
    }
  ],
}

This is a partial of what will be returned. Now I have Martha’s information. For my purposes, this should be enough to catalog my books. While I don’t have a ton of books, I would prefer to have the data locally to reduce using their APIs and to make it easier to run a bunch of tests.

Bulk Download

Now that I have an idea of the data I want, I should be able to download the following:

> wget https://openlibrary.org/data/ol_dump_authors_latest.txt.gz
> wget https://openlibrary.org/data/ol_dump_editions_latest.txt.gz
> wget https://openlibrary.org/data/ol_dump_works_latest.txt.gz

For example, according author’s documentation it should have the following properties:

name of type /type/string
eastern_order of type /type/boolean
personal_name of type /type/string
enumeration of type /type/string
title of type /type/string
alternate_names[] of type /type/string
uris[] of type /type/string
bio of type /type/text
location of type /type/string
birth_date of type /type/string
death_date of type /type/string
date of type /type/string
wikipedia of type /type/string
links[] of type /type/link

This is similar to using the API. Keep in mind that total compressed size for all the files will be in the ~13GB range. Since I plan on using DuckDB, this shouldn’t be a problem.

DuckDB

While in the documentation from Open Library they mention openlibrary-search to aid with using PostgreSQL and Python, I decided to use DuckDB to help me have a reason to use and learn DuckDB more. First step is to get an idea of how the data dump looks. For now I will continue to focus on authors.

D describe from read_csv('ol_dump_authors_latest.txt.gz');
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ column0     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ column1     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ column2     │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ column3     │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ column4     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Initially not really helpful.

D summarize from read_csv('ol_dump_authors_latest.txt.gz');
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────┬─────────────┬──────────────────────┬──────────────────────┬───────────────┬───┬──────────────────────┬──────────────────────┬──────────┬─────────────────┐
│ column_name │ column_type │         min          │         max          │ approx_unique │ … │         q50          │         q75          │  count   │ null_percentage │
│   varchar   │   varchar   │       varchar        │       varchar        │     int64     │   │       varchar        │       varchar        │  int64   │  decimal(9,2)   │
├─────────────┼─────────────┼──────────────────────┼──────────────────────┼───────────────┼───┼──────────────────────┼──────────────────────┼──────────┼─────────────────┤
│ column0     │ VARCHAR     │ /type/author         │ /type/author         │             1 │ … │ NULL                 │ NULL                 │ 14327506 │            0.00 │
│ column1     │ VARCHAR     │ /authors/OL10000001A │ /authors/OL9A        │      15582734 │ … │ NULL                 │ NULL                 │ 14327506 │            0.00 │
│ column2     │ BIGINT      │ 1                    │ 691                  │            79 │ … │ 1                    │ 1                    │ 14327506 │            0.00 │
│ column3     │ TIMESTAMP   │ 2008-04-01 03:28:5…  │ 2025-04-30 23:59:5…  │      12590164 │ … │ 2020-07-01 12:02:4…  │ 2022-12-16 09:47:0…  │ 14327506 │            0.00 │
│ column4     │ VARCHAR     │ {"alternate_names"…  │ {"works": [{"key":…  │      12544761 │ … │ NULL                 │ NULL                 │ 14327506 │            0.00 │
├─────────────┴─────────────┴──────────────────────┴──────────────────────┴───────────────┴───┴──────────────────────┴──────────────────────┴──────────┴─────────────────┤
│ 5 rows                                                                                                                                            12 columns (9 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This is a little bit better. We can see that column0 is just telling us that the data contains authors. column1 is the author’s ID which will be helpful to search for the author after searching the book by ISBN. column2 not really sure if it is useful since that column is an integer and only have 79 unique values even though it appears that we have 14 million rows. column3 is a timestamp and my be when the data was entered since the range is from 2008 to the end of April 2025. Note that I downloaded the data in May 2025. Finally not sure what column0 have though looking at the min and max columns that may be JSON data.

D select * from read_csv('ol_dump_authors_latest.txt.gz') limit 10;
┌──────────────┬──────────────────────┬─────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│   column0    │       column1        │ column2 │       column3        │                                                    column4                                                    │
│   varchar    │       varchar        │  int64  │      timestamp       │                                                    varchar                                                    │
├──────────────┼──────────────────────┼─────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ /type/author │ /authors/OL10000375A │       1 │ 2021-12-26 22:14:2…  │ {"type": {"key": "/type/author"}, "name": "Dennis Kott", "key": "/authors/OL10000375A", "source_records": […  │
│ /type/author │ /authors/OL10000593A │       1 │ 2021-12-26 22:41:0…  │ {"type": {"key": "/type/author"}, "name": "Henry Frierson", "key": "/authors/OL10000593A", "source_records"…  │
│ /type/author │ /authors/OL10000780A │       1 │ 2021-12-26 23:16:3…  │ {"type": {"key": "/type/author"}, "name": "Gabi Farkas", "key": "/authors/OL10000780A", "source_records": […  │
│ /type/author │ /authors/OL10000964A │       1 │ 2021-12-26 23:47:4…  │ {"type": {"key": "/type/author"}, "name": "Andols Herrick", "key": "/authors/OL10000964A", "source_records"…  │
│ /type/author │ /authors/OL10001395A │       1 │ 2021-12-27 00:57:1…  │ {"type": {"key": "/type/author"}, "name": "Josh Bleill", "key": "/authors/OL10001395A", "source_records": […  │
│ /type/author │ /authors/OL10001421A │       1 │ 2021-12-27 01:01:1…  │ {"type": {"key": "/type/author"}, "name": "Alfonso Cano Farragute", "key": "/authors/OL10001421A", "source_…  │
│ /type/author │ /authors/OL10001453A │       1 │ 2021-12-27 01:02:5…  │ {"type": {"key": "/type/author"}, "name": "York County Heritage Trust", "key": "/authors/OL10001453A", "sou…  │
│ /type/author │ /authors/OL10001580A │       1 │ 2021-12-27 01:18:5…  │ {"type": {"key": "/type/author"}, "name": "S. L. Ross", "key": "/authors/OL10001580A", "source_records": ["…  │
│ /type/author │ /authors/OL10001594A │       1 │ 2021-12-27 01:20:4…  │ {"type": {"key": "/type/author"}, "name": "Sheila S. Hatch", "key": "/authors/OL10001594A", "source_records…  │
│ /type/author │ /authors/OL10001656A │       1 │ 2021-12-27 01:33:0…  │ {"type": {"key": "/type/author"}, "name": "Amend, Jr., William J. C.", "key": "/authors/OL10001656A", "sour…  │
├──────────────┴──────────────────────┴─────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 10 rows                                                                                                                                                                    5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Now we have a decent idea what we are looking at. Let’s verify that I can find Martha Wells.

D .columns
D select column0
        ,column1 as author_id
        ,column2
        ,column3 as created
        ,json_extract(column4, '$.name') as name
        ,json_extract(column4, '$.personal_name') as personal_name
        ,json_extract(column4, '$.alternate_names') as alternate_names
        ,json_extract(column4, '$.bio') as bio
        ,json_extract(column4, '$.location') as location
        ,json_extract(column4, '$.birth_date') as birth_date
        ,json_extract(column4, '$.death_date') as death_date
        ,json_extract(column4, '$.date') as date
        ,json_extract(column4, '$.links') as links
    from read_csv('ol_dump_authors_latest.txt.gz')
  where author_id ilike '%OL221294A%'
  limit 10;
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│     Column      │   Type    │                                                                         Row 1                                                                          │
├─────────────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ column0         │ varchar   │                                                                                                                                           /type/author │
│ author_id       │ varchar   │                                                                                                                                     /authors/OL221294A │
│ column2         │ int64     │                                                                                                                                                     14 │
│ created         │ timestamp │                                                                                                                             2025-03-15 03:32:28.650222 │
│ name            │ json      │                                                                                                                                         "Martha Wells" │
│ personal_name   │ json      │                                                                                                                                         "Martha Wells" │
│ alternate_names │ json      │                                                                                                               ["Wells, Martha","Wells, M.","M. Wells"] │
│ bio             │ json      │  "Martha Wells (born September 1, 1964) is an American writer of speculative fiction. She has published a number of fantasy novels, young adult novel… │
│ location        │ json      │                                                                                                                                                   NULL │
│ birth_date      │ json      │                                                                                                                                     "1 September 1964" │
│ death_date      │ json      │                                                                                                                                                   NULL │
│ date            │ json      │                                                                                                                                                   NULL │
│ links           │ json      │  [{"title":"Official Website","url":"https://marthawells.com","type":{"key":"/type/link"}},{"title":"Official Blog","url":"https://marthawells.dreamw… │
└─────────────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Notice that I used .columns since I was pretty sure that I am was going to get one row, so I returned the data column-wise. Now the data is looking better. If I was going to save the data to a table, I would want to do some further cleaning.

Transform the Data

I should be able to convert some of the columns from JSON to strings. Also for the author ID, I don’t want to include /authors/. If I can do this while also not include column0, that would be great. I could then create a SQL script to either pull in all the data or I could only add the authors that I actually have their books.

D select column0
        ,column1[10:] as author_id
        ,column2
        ,column3 as created
        ,json_extract_string(column4, '$.name') as name
        ,json_extract_string(column4, '$.personal_name') as personal_name
        ,json_extract(column4, '$.alternate_names') as alternate_names
        ,json_extract_string(column4, '$.bio') as bio
        ,json_extract_string(column4, '$.location') as location
        ,json_extract_string(column4, '$.birth_date') as birth_date
        ,json_extract_string(column4, '$.death_date') as death_date
        ,json_extract_string(column4, '$.date') as date
        ,json_extract(column4, '$.links') as links
    from read_csv('ol_dump_authors_latest.txt.gz')
  where author_id ilike '%OL221294A%'
  limit 10;
100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│     Column      │   Type    │                                                                         Row 1                                                                          │
├─────────────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ column0         │ varchar   │                                                                                                                                           /type/author │
│ author_id       │ varchar   │                                                                                                                                              OL221294A │
│ column2         │ int64     │                                                                                                                                                     14 │
│ created         │ timestamp │                                                                                                                             2025-03-15 03:32:28.650222 │
│ name            │ varchar   │                                                                                                                                           Martha Wells │
│ personal_name   │ varchar   │                                                                                                                                           Martha Wells │
│ alternate_names │ json      │                                                                                                               ["Wells, Martha","Wells, M.","M. Wells"] │
│ bio             │ varchar   │  Martha Wells (born September 1, 1964) is an American writer of speculative fiction. She has published a number of fantasy novels, young adult novels… │
│ location        │ varchar   │                                                                                                                                                   NULL │
│ birth_date      │ varchar   │                                                                                                                                       1 September 1964 │
│ death_date      │ varchar   │                                                                                                                                                   NULL │
│ date            │ varchar   │                                                                                                                                                   NULL │
│ links           │ json      │  [{"title":"Official Website","url":"https://marthawells.com","type":{"key":"/type/link"}},{"title":"Official Blog","url":"https://marthawells.dreamw… │
└─────────────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D

Much better. Using json_extract_string() and string slices, I am almost there. Now I want to get rid of column0 which I should be able to using a subquery.

D select  column1[10:] as author_id
          ,column3 as created
          ,json_extract_string(column4, '$.name') as name
          ,json_extract_string(column4, '$.personal_name') as personal_name
          ,json_extract(column4, '$.alternate_names') as alternate_names
          ,json_extract_string(column4, '$.bio') as bio
          ,json_extract_string(column4, '$.location') as location
          ,json_extract_string(column4, '$.birth_date') as birth_date
          ,json_extract_string(column4, '$.death_date') as death_date
          ,json_extract_string(column4, '$.date') as date
          ,json_extract(column4, '$.links') as links
      from (select *
      from read_csv('ol_dump_authors_latest.txt.gz')
       where column1 ilike '%OL221294A%'
       )
    limit 10;

100% ▕████████████████████████████████████████████████████████████▏ 
┌─────────────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│     Column      │   Type    │                                                                         Row 1                                                                          │
├─────────────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ author_id       │ varchar   │                                                                                                                                              OL221294A │
│ created         │ timestamp │                                                                                                                             2025-03-15 03:32:28.650222 │
│ name            │ varchar   │                                                                                                                                           Martha Wells │
│ personal_name   │ varchar   │                                                                                                                                           Martha Wells │
│ alternate_names │ json      │                                                                                                               ["Wells, Martha","Wells, M.","M. Wells"] │
│ bio             │ varchar   │  Martha Wells (born September 1, 1964) is an American writer of speculative fiction. She has published a number of fantasy novels, young adult novels… │
│ location        │ varchar   │                                                                                                                                                   NULL │
│ birth_date      │ varchar   │                                                                                                                                       1 September 1964 │
│ death_date      │ varchar   │                                                                                                                                                   NULL │
│ date            │ varchar   │                                                                                                                                                   NULL │
│ links           │ json      │  [{"title":"Official Website","url":"https://marthawells.com","type":{"key":"/type/link"}},{"title":"Official Blog","url":"https://marthawells.dreamw… │
└─────────────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D

This looks like the data I will add to my database. Also it wasn’t too slow on my machine querying a compressed file.

Conclusion

In the past I would have used Pandas to pull in the data before dumping into PostgreSQL. Pandas can still be useful for this level of cleaning however due to the size of the data, would probably need to be careful loading the data into memory. It is also certainly possible to do this with PostgreSQL but PostgreSQL is data loading is much more fragile. While not in this article, it is possible to attach PostgreSQL to DuckDB, do the work in DuckDB and save the data to PostgreSQL.

This initial analysis helps me understand the data structure and how I would like to design my program for data ingestion. DuckDB can be an excellent alternative for data ingestion and cleaning.

How to Run Python in Production Stop Shipping Dashboards that Don't Matter