HathiTrust Usage Analytics & Metadata Analysis Notes

High-Level Steps

  • Scrape analytics
  • Process Analytics to extract volume IDs and usage counts
    • Fix the dollar sign barcode issue
  • Ingest Hathifiles into postgres database
  • Match IDs from analytics to current metadata in Hathifiles
  • Create visualizations of interesting facets

Process Notes

Scrape analytics

Using Pyganalytics: https://github.com/chrpr/pyganalytics

Scrape the analytics using something like this (do this 4x, once for each quarter of the year, adjusting command as needed):

for i in {4..7}; do time python analytics.py -o ~/PATH/TO/DATA/OUTPUT/uniqueEvents_201$i\_01_01-201$i\_12_31.csv -c ~/PATH/TO/CONFIG/hathi_events_config.yml -f 201$i-01-01 -l 201$i-12-31; done

The yml config file is the following for the HathiTrust pageturner analytics account:

yml
query:
    metrics: ga:uniqueEvents
    dimensions: ga:pagePath
    sort:
    filters:
profile: 26478114

Note: I adjusted the delimiters used in the Pyganalitcs analytics.py script because page paths in Google Analytics contain everything under the sun and I needed to amend to try and find something that would be unique enough to work as a field separator that Pandas can recognize. However, this also means that you have to use the Python parsing engine when reading the CSV into Pandas, but given that this is a one-time operation I think the tradeoff here of doing less surgery on the analytics CSVs is worth the potential slowdown here.

I also edited the analytics.py script to run daily instead of weekly in order to try and capture more granular results. See notes on how to do that in the Pyganalytics readme.

Notes:

  • these tended to run between 30-60 minutes for each quarter on my machine, via a wireless connection
  • Broken into annual quarters just to keep file sizes more manageable and prevent less data loss if API errored out mid-file
  • this uses Google Analytics API v3, not v4 which is current, so may break sooner or later Set up an API key as described in the readme
  • There is a 10000 API call per profile ID for the Google Analytics API; as a result, I had to run this over the course of 3 days.

Other notes

Also something to explore: there are a limited set of results with zero pageviews, even just limiting to event triggers. But since even with an event count of zero, something is triggering the sending of these pages to analytics, I'm opting to include these at the moment.

Setup

In [2]:
!pip3 install sqlalchemy
!pip3 install rpy2
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.6/site-packages (1.2.7)
Requirement already satisfied: rpy2 in /usr/local/lib/python3.6/site-packages (2.9.3)
Requirement already satisfied: six in /usr/local/lib/python3.6/site-packages (from rpy2) (1.11.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/site-packages (from rpy2) (2.10)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/site-packages (from jinja2->rpy2) (1.0)
In [83]:
#For inline matplotlib plots
%matplotlib inline

#For data manipulation and and analysis
import pandas as pd
import dask.dataframe as dd #USE `pip3 install 'dask[dataframe]'`
from dask.diagnostics import ProgressBar

#For plotting and dataviz
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set_context("talk", font_scale=1.2)

#So we can use the better ggplot2 R packages to create some of the diagrams below
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

%load_ext rpy2.ipython
%R library(tidyverse)

#Dask progress bar so we can get a view into long-running processes
pbar = ProgressBar()
pbar.register()
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

Process Analytics to extract volume IDs and usage counts

In [5]:
def extract_ids(): 
    
    '''
    Uses Dask to extracts HathiTrust IDs from the raw analytics logs and writes them to a series of CSVs
    '''
 
    df = dd.read_csv('./data/HathiTrust_unique_events_201*.csv')
    
    #Mix of my regex experimentation and pattern supplied by Angelina Z at Hathi:
    pattern = '(?:id=|[a-z0-9]\/)([a-z][a-z0-9]{1,3}\.\$?[a-z0-9._:\/\-]+)'
    
    #Extract the ID matches into a new column   
    df['id'] = df['ga:pagePath'].str.extract(pattern, expand=False)
    
    #Limit to just rows with ID matches
    df = df[df['id'].notnull()]

    #Remove rows that have 'skin=crms'
    df = df[df['ga:pagePath'].str.contains('skin=crms') == False]
    
    #Remove some junk punctuation from the end of some IDs
    df['id'] = df['id'].str.rstrip('._#')
    
    #Rename 'ga:uniqueEvents' column to 'count'
    df = df.rename(columns={'ga:uniqueEvents': 'count'})
    
    #Write the results to csvs
    df.to_csv('./data/all_ids_*.csv', index=False)
    
#This takes roughly 5min to run

%time extract_ids()
In [14]:
def ids_count(): 

    '''
    Uses output csvs of extract_ids() function to create a tuple of volume IDs paired with pageview counts 
    (either unique or aggregate pageview counts depending on how analytics were scraped) 
    and writes them to a CSV
    '''
    
    #Read in all the files with the extracted IDs and their counts
    df = dd.read_csv('./data/all_ids_*.csv')
    
    #Group that data by volume identifier, and then record the sum total of all hit counts
    ids = df.groupby(by=['id'])['count'].sum()
    
    #Turn dask dataframe into pandas dataframe so we can use sort_values (not implemented yet in dask)
    ids = ids.compute()    
    
    #Do some column naming, and then export to a CSV
    ids.index.name = 'id'
    ids.columns = ['count']
    ids = ids.sort_values(ascending=False).to_csv('./csv/all_counts_sorted.csv', header=True)

#This takes roughly 12min to run
%time ids_count()
[########################################] | 100% Completed |  3min  5.3s
[########################################] | 100% Completed |  3min  5.4s
[########################################] | 100% Completed |  3min  5.5s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.3s
CPU times: user 2min 42s, sys: 21.9 s, total: 3min 4s
Wall time: 3min 39s
In [19]:
pd.options.mode.chained_assignment = None

def fix_dollar_sign_ids():

    '''
    Need to Fix UC $ issue, and roll up the total counts so they're only counted 
    as single items with all hits. There's some ugly pandas in here, could certainly 
    be more effficient but it's quick enoughthat I'm calling it done for now
    '''
    
    #Need numpy briefly to do the conditional check later using .where() method
    import numpy as np
    
    #Import the extracted IDs and counts
    df = pd.read_csv('./csv/all_counts_sorted.csv')
    
    #New dataframe of just the potentially affected dollar sign IDs
    dollars = df[df['id'].str.contains('\$')]

    #New column with the id version minus dollar sign
    dollars.loc[:,'fixed_id'] = dollars['id'].copy().str.replace('$','')
    
    #Merge with original data to get access to all counts we need to sum
    merged = df.merge(dollars, left_on='id', right_on='fixed_id', suffixes=['_df','_dollars'])

    #Sum the hit counts for the dollar sign IDs and the non dollar sign IDs
    merged.loc[:,'total'] = merged['count_df'] + merged['count_dollars']

    #Merge the totals with the original dataset again, but in a new df just to be careful
    df2 = df.merge(merged, how='outer', left_on='id', right_on='id_dollars')
    
    #Update the original 'count' column to the holistic total where needed
    df2.loc[:, 'count'] = np.where(df2['total'].notnull() == True, df2['total'], df['count'])
    
    #Remove the non-dollar sign IDs from the df, since the dollar-sign-id totals now reflect 
    #the holistic sum. Also removes the now extraneous extra columns
    df2 = df2[df2['id'].isin(df2['id_df']) == False].copy()[['id','count']]
    
    #Write it all out to a new csv
    df2.sort_values('count', ascending=False).to_csv('./csv/all_counts_sorted_dollar_fixed.csv')

#This takes like 35 seconds to run
%time fix_dollar_sign_ids()
CPU times: user 32.5 s, sys: 2.6 s, total: 35.1 s
Wall time: 39.7 s

Postgres import of Hathifiles

In [4]:
#Set up the postgres connection

import sqlalchemy as sa

def connect(user, password, db, host='localhost', port=5432):
    '''Returns a connection and a metadata object'''
    # We connect with the help of the PostgreSQL URL
    url = 'postgresql://{}:{}@{}:{}/{}'
    url = url.format(user, password, host, port, db)

    # The return value of create_engine() is our connection object
    con = sa.create_engine(url, client_encoding='utf8')

    # We then bind the connection to MetaData()
    meta = sa.MetaData(bind=con).reflect()

    return con, meta

con, meta = connect('postgres', '', 'hathifiles')
/usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)
In [21]:
#Import the Hathifiles into a postgres database


'''
This cell does require ones step not included here which is adding the header row 
to the txt file with the column names, which I do in an inelegant but fine way: 
- Copy row headers, including tab delimiters, to new file
- Append text of massive hathi text file (~4gb) to that new file 
    `cat hathi_full_20171101.txt >> headers_for_hathi_full_20171101.txt`
- Delete old file, and rename new file same as the old (but now includes column headers)

'''

#Struggled for a long time with this, but turns out the delimiter needs to be r'\t', not just '\t'
#hathi_data = pd.read_csv('./data/hathifiles/hathi_full_20180501.txt', engine='python', delimiter=r'\t', encoding='utf-8', chunksize=1000000)


def postgres_import():
    i = 0
    for chunk in hathi_data:
        chunk = chunk[['id','access', 'rights', 'hathitrust_record_number', 'enumeration_chronology', 'source', 'source_institution_record_number', 'oclc_numbers', 'isbns', 'issns', 'lccns', 'title', 'imprint', 'rights_determination_reason_code', 'date_of_last_update', 'government_document', 'publication_date', 'publication_place' ,'language', 'bibliographic_format', 'collection_code', 'content_provider_code', 'responsible_entity_code', 'digitization_agent_code']]
        try:
            chunk.to_sql('hathifiles', con, if_exists='append')
            print (i, chunk.index[0])
            i += 1
        except:
            print (chunk.index[0])

#This takes roughly 2 hours 10min to run on my macbook air
#%time postgres_import()

Match IDs from analytics to current metadata in Hathifiles

In [22]:
def get_access_and_date():

    gf = pd.read_csv('./csv/all_counts_sorted_dollar_fixed.csv')

    gf['access'] = ''
    gf['date'] = ''
    gf['title'] = ''
    gf['oclc'] = ''
#     gf['format'] = ''
#     gf['pub_place'] = ''
    print (len(gf))

    header = 'id,title,access,date,oclc,gov_doc,publication_place,language,bib_format'
    text_file = open("./csv/all_id_title_access_date_oclc_languages_formats.csv", "w")
    text_file.write(header+'\n')
    text_file.close()
    
    for i in range(0,len(gf),250000):
        xf = gf[i:(i+250000)]
        ids = []
        for index, row in xf.iterrows():
            ids.append(row['id'])
        x = pd.read_sql_query("select id, title, access, publication_date, oclc_numbers, government_document, publication_place, language, bibliographic_format \
                                from hathifiles where id in"+str(tuple(ids)), con=con)
        x.to_csv('./csv/all_id_title_access_date_oclc_languages_formats.csv', mode='a', encoding='utf-8', header=False, index=False)
        if i % 10000 == 0:
            print (i)

#This takes about 33min if the postgres table has an index on "id"            
#%time get_access_and_date()

Analysis and viz steps

Some cells below are just checks to make sure outputs look roughly correct, others set up some data for visualization or other manipulation

In [5]:
#Read our various CSVs into dataframes so we can work with them 
counts = pd.read_csv('./csv/all_counts_sorted.csv')
d_counts = pd.read_csv('./csv/all_counts_sorted_dollar_fixed.csv')
full = pd.read_csv('./csv/all_id_title_access_date_oclc_languages_formats.csv')
In [24]:
# Gut check on dollar sign fix and missing volumes
cs = set(counts['id'])
ds = set(d_counts['id'])
fs = set(full['id'])

#Total number of dollar sign ids fixed
print("dollar sign ids fixed: %s" % (len(cs - ds)))


'''
To get a sense if we've missed things with the above data transformations
This spits out the things that the parsing found as IDs, but couldn't find in the HathiFiles
Generally, there are a few hundred here, things that have been added to Hathi via ingest and therefore
Are available to analytics events, but haven't had their metadata added to the monthly Hathifile updates
Gut checking, a delta of less than a thousand a month seems ok, and not something I'm super worried about 
in terms of skewing data
'''
diff = ds - fs

missing = []
def print_missing():
    for d in diff: 
        if "uc1" not in d:
            missing.append(d)
        else:
            pass
    print ("Total non-dollar sign IDs in data, but not found in hathifiles: %s" % (len(missing)))

    for item in missing:
        print (item)
    
#print_missing()
dollar sign ids fixed: 456
In [6]:
'''This merges the metadata extracted from the Hathifiles with the top counts from the analytics,
and spits out the list of the most viewed items in HathiTrust
But this could easily be tweaked to show top NYPL items, top items that were denied access (excellent ILL candidates),
top items published in a given country, etc. '''
counts_ids = full.merge(d_counts, on='id', suffixes=['_full','_counts'])
counts_ids = counts_ids.drop('Unnamed: 0', axis=1)
In [7]:
#Double check -- how many items had event triggers recorded in order to appear in the analytics, but had the count recorded as zero?
zeroes = counts_ids[counts_ids['count'] < 1]
print (len(zeroes))
0
In [86]:
#full = pd.read_csv('./csv/all_id_title_access_date_oclc.csv')

allow = full[full.access == 'allow']
deny = full[full.access == 'deny']

print ("%s total volumes that triggered analytics events in the collected data" % len(full))

print ("%s total open volumes have triggered analytics events in the collected data" % len(allow))

print ("%s total limited view volumes have triggered analytics events in the collected data" % len(deny))
4810938 total volumes that triggered analytics events in the collected data
3268810 total open volumes have triggered analytics events in the collected data
1542128 total limited view volumes have triggered analytics events in the collected data

Top title analysis

Top 25 titles in Hathi

In [107]:
#Top 25 titles in Hathi
counts_ids[['id','title','date','count']].sort_values('count', ascending=False).head(25)
Out[107]:
id title date count
98680 mdp.39015054061430 Quicksand, by Nella Larsen. 1928.0 127787.0
70341 mdp.39015011274175 The surnames of Scotland, their origin meaning... 1962.0 77211.0
61560 mdp.39015004111095 Godey's magazine. 1850.0 70049.0
57531 mdp.39015000566789 America is in the heart, a personal history, b... 1946.0 61899.0
166025 pst.000057937434 The human figure / by John H. Vanderpoel. 1907.0 61169.0
183186 uc1.32106007458745 History of wages in the United States from Col... 1934.0 61035.0
104139 mdp.39015064340733 Solid mensuration, by Willis F. Kern and James... 1934.0 53630.0
57749 mdp.39015000804453 Perfume and flavor materials of natural origin. 1960.0 53452.0
66525 mdp.39015008158415 Quintus Curtius [History of Alexander] with an... 1946.0 45034.0
176102 uc1.$b99721 The five laws of library science, by S. R. Ran... 1931.0 44317.0
93003 mdp.39015038069475 Return to life through contrology, by Joseph H... 1960.0 43525.0
111450 mdp.39015071886035 [Publications] 9999.0 38244.0
167555 pur1.32754077064610 Investigation of Korean-American relations : R... 1978.0 37209.0
64981 mdp.39015006749868 Modern California houses; case study houses, 1... 1962.0 35388.0
96448 mdp.39015048226941 The lesson of Japanese architecture. 1954.0 33520.0
236936 wu.89059402255 Roster of the Confederate soldiers of Georgia,... 9999.0 33184.0
73616 mdp.39015014103017 The book of a hundred hands. 1920.0 32199.0
57261 mdp.39015000379902 Propaganda technique in the World War [by] Har... 1938.0 32115.0
236938 wu.89059402289 Roster of the Confederate soldiers of Georgia,... 9999.0 30867.0
66524 mdp.39015008158407 Quintus Curtius [History of Alexander] with an... 1946.0 30823.0
218162 uiug.30112101024682 A short guide to New Zealand. 1943.0 30728.0
176728 uc1.31158001963940 Sears; [catalog] 1897.0 30299.0
236184 wu.89054395496 Architecture and furniture: Aalto. 1938.0 29711.0
195221 uc1.b4164139 Circuit analysis of A-C power systems; symmetr... 1950.0 29455.0
43993 inu.30000007109121 Pennsylvania German pioneers; a publication of... 1934.0 28781.0

Top 25 limited view titles in Hathi

In [109]:
#Top 25 limited view titles in Hathi
counts_ids[counts_ids.access == 'deny'][['id','title','date','count']].sort_values('count', ascending=False).head(25)
Out[109]:
id title date count
63936 mdp.39015005717635 Insularismo; ensayos de interpretación puertor... 1942.0 10496.0
45447 inu.30000103012815 Kasaita / na Maryam Kabir Mashi. 9999.0 7562.0
106335 mdp.39015066789838 Theogony ; and, Works and days / Hesiod ; tran... 2006.0 7176.0
46203 inu.30000124268446 Rufaida ko mufida? / na Hadiza Salisu Sharif. 9999.0 5947.0
245007 wu.89089669436 Elements of applied thermodynamics, by Robert ... 1951.0 4207.0
183515 uc1.32106010955992 NCRP report 9999.0 4119.0
198128 uc1.b4906221 The competent manager : a model for effective ... 1982.0 3905.0
95880 mdp.39015046422120 Nectar in a sieve, a novel. 1954.0 3838.0
179916 uc1.31822003845724 The Naval Institute guide to world naval weapo... 1989.0 3660.0
122620 mdp.39076006350719 The theory of spherical and ellipsoidal harmon... 1931.0 3374.0
69304 mdp.39015010576356 Objects of daily use, with over 1800 figures f... 1927.0 3290.0
69301 mdp.39015010574575 Catalogue of Alexandrian coins, 1933.0 2493.0
111334 mdp.39015071754159 The Michigan daily. 1969.0 2424.0
98163 mdp.39015052047589 American archival studies : readings in theory... 2000.0 2355.0
183962 uc1.32106017264836 Female pipings in Eden, by Ethel Smyth .. 1934.0 2178.0
238807 wu.89062880828 Launcelot Granger of Newbury, Mass., and Suffi... 1989.0 2161.0
237107 wu.89060437159 A genealogical memoir of the Lo-Lathrop family... 1971.0 2139.0
238137 wu.89062511985 A genealogy of the Van Winkle family : account... 1996.0 2081.0
240405 wu.89066083726 Genealogical and biographical notices of desce... 1996.0 2064.0
235650 wu.89047254495 The co-operative directory : a complete record... 1940.0 2063.0
45345 inu.30000095331652 Calendar of inquisitions miscellaneous, Chance... 9999.0 1950.0
240548 wu.89066140377 Genealogy of the Folsom family : a revised and... 9999.0 1921.0
45446 inu.30000103009852 Uwar miji ko kishiya? / na Zahra'u Baba Abdull... 9999.0 1891.0
241737 wu.89067471904 Erie County, New York obituaries, as found in ... 1992.0 1838.0
70805 mdp.39015011482067 Württembergisches Adels- und Wappenbuch / im A... 1975.0 1814.0

Top 25 Hathi volumes scanned from NYPL collections

In [111]:
#Top 25 Hathi volumes scanned from NYPL collections
counts_ids[counts_ids['id'].str.startswith('nyp') == True][['id','title','date','count']].sort_values('count', ascending=False).head(25)
Out[111]:
id title date count
152464 nyp.33433076064025 Miranda / by Grace Livingston Hill Lutz ... ; ... 1915.0 8362.0
147349 nyp.33433066397708 Illustrated catalogue of hand and power pumps,... 1903.0 6644.0
149765 nyp.33433069455859 Illustrated trade catalogue and price list : m... 1897.0 6603.0
142114 nyp.33433000335228 Glossarium ad scriptores mediae et infimae Lat... 1736.0 6486.0
153570 nyp.33433081675450 Godey's magazine. 1831.0 4961.0
150488 nyp.33433072182490 Art monograms and lettering, by J.M. Bergling,... 1912.0 4849.0
145673 nyp.33433023758695 Report of the Committee of Secrecy on the Bank... 1832.0 4578.0
155542 nyp.33433081893293 L'Egypte à L'Exposition universelle de 1867 /... 1867.0 4493.0
155125 nyp.33433081844692 A standard history of Stark County, Ohio : an ... 1916.0 4409.0
145650 nyp.33433023615366 Regimental colors of the German armies in the ... 1911.0 4354.0
142116 nyp.33433000335244 Glossarium ad scriptores mediae et infimae Lat... 1736.0 4051.0
143221 nyp.33433006773448 Home needlework magazine ... 1912.0 3955.0
144319 nyp.33433009488465 The law reports,. under the superintendence an... 1884.0 3805.0
142119 nyp.33433000335277 Glossarium ad scriptores mediae et infimae Lat... 1736.0 3734.0
156150 nyp.33433082137914 Wife no. 19, or the story of a life in bondage... 1875.0 3725.0
146016 nyp.33433037323635 Illustrated catalogue of Seth Thomas, New Have... 1878.0 3717.0
148004 nyp.33433066642897 American chess magazine. 1899.0 3683.0
160035 nyp.33433105621522 The Graphic : an illustrated weekly newspaper. 1874.0 3451.0
160461 nyp.33433112041938 A book of verses / by William Ernest Henley. 1888.0 3433.0
150364 nyp.33433071393981 Chinese pictures : notes on photographs made i... 1900.0 3388.0
158553 nyp.33433087540096 Hardware manufactured by P. & F. Corbin. 1905.0 3334.0
153583 nyp.33433081675583 Godey's magazine. 1850.0 3302.0
143022 nyp.33433006349736 A specimen of printing types, and ornaments, c... 1828.0 3297.0
151334 nyp.33433075730378 2835 Mayfair : a novel / by Frank Richardson. 1907.0 3290.0
154718 nyp.33433081810370 Colonial families of the Southern states of Am... 1911.0 3282.0

Publication Year Analysis

N.B. See the interactive dataviz for a larger and more useful view of this data

In [10]:
all_years = full[(full.date > 1799) & (full.date < 2018)].groupby('date')['id'].count()
allow_years = allow[(allow.date > 1799) & (allow.date < 2018)].groupby('date')['id'].count()
deny_years = deny[(deny.date > 1799) & (deny.date < 2018)].groupby('date')['id'].count()
In [46]:
#Plots publication date of volumes with analytics events, as full-view vs limited view volumes

#Figure 1

#all_years.plot()
allow_years.plot(figsize=(10,7))
deny_years.plot(figsize=(10,7))
plt.xlabel('Date of publication')
plt.ylabel('total number of volumes')
plt.title('"Full View" (blue) and "Limited View" (orange) clicked volumes in HathiTrust');
In [12]:
#This grabs all dates from the postgres DB to some data analysis and histograms, etc.

all_dates = pd.read_sql_query("SELECT DISTINCT publication_date, count(publication_date) \
           from hathifiles GROUP BY publication_date ORDER BY publication_date ASC", con=con)
all_dates_a = pd.read_sql_query("SELECT DISTINCT publication_date, count(publication_date) \
           from hathifiles WHERE access = 'allow' GROUP BY publication_date ORDER BY publication_date ASC", con=con)
all_dates_d = pd.read_sql_query("SELECT DISTINCT publication_date, count(publication_date) \
           from hathifiles WHERE access = 'deny' GROUP BY publication_date ORDER BY publication_date ASC", con=con)
In [13]:
dates = all_dates[(all_dates.publication_date < 2018) & (all_dates.publication_date > 1799)]
dates_a = all_dates_a[(all_dates_a.publication_date < 2018) & (all_dates_a.publication_date > 1799)]
dates_d = all_dates_d[(all_dates_d.publication_date < 2018) & (all_dates_d.publication_date > 1799)]
In [28]:
dates.index = dates.publication_date
dates_a.index = dates_a.publication_date
dates_d.index = dates_d.publication_date
dates_a.loc[:,'accessed'] = allow_years

dates.loc[:,'full view'] = dates_a['count']
dates.loc[:,'limited view'] = dates_d['count']
dates.loc[:,'full view accessed'] = allow_years
dates.loc[:,'limited view attempted'] = deny_years

#dates_a
#dates

#To export all counts, full view counts, and limited view counts to a CSV for easier use later: 
#dates.to_csv('./csv/all_years_data.csv')
/usr/local/lib/python3.6/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Utilization analysis

Below takes a look at the total number of openly available volumes, and what percentage have been accessed since mid-2013 (figure 2)

In [88]:
#Calculate raw utilization rate

util = dates['full view accessed'] / dates['count']
dates_a.copy().loc[:, 'percent'] = dates_a.accessed.copy() / dates_a['count'].copy()

print ("The overall average utilization rate for open volumes 1800-2017 is: {0:.2f}%\n".format(100*dates_a['percent'].mean()))
print ("The average utilization rate for open volumes 1800-1875 is: {0:.2f}%\n".format(100*dates_a[1800:1875]['percent'].mean()))
print ("The average utilization rate for open volumes 1876-1922 is: {0:.2f}%\n".format(100*dates_a[1876:1922]['percent'].mean()))
print ("The average utilization rate for open volumes 1923-1962 is: {0:.2f}%\n".format(100*dates_a[1923:1962]['percent'].mean()))
print ("The average utilization rate for open volumes 1963-2017 is: {0:.2f}%\n".format(100*dates_a[1963:2017]['percent'].mean()))
The overall average utilization rate for open volumes 1800-2017 is: 56.08%

The average utilization rate for open volumes 1800-1875 is: 60.90%
The average utilization rate for open volumes 1876-1922 is: 51.86%
The average utilization rate for open volumes 1923-1962 is: 61.47%
The average utilization rate for open volumes 1963-2017 is: 49.09%
In [89]:
#plot open volume utilization per year

#Figure 2

dates_a['percent'].plot(figsize=(10,7)).set_ylim([0,1])
plt.xlabel('Date of publication')
plt.ylabel('Percentage of open volumes accessed')
plt.title('"Full View" volumes utilization per year');
In [31]:
# #Plot publication year distribution of accessed volumes (blue, left) against utilization rate by year (red, right)
# import matplotlib.ticker as ticker

# fig, ax1 = plt.subplots(figsize=(10,7))
# ax1.plot(allow_years, color='b')
# ax1.set_ylabel('# of works', color='b')
# ax1.tick_params('y', colors='b')
# #ax1.set_yticks([0,10000])
# ax1.set_ylim([0,70000])

# ax2 = ax1.twinx()
# ax2.plot(util, color='r')
# ax2.set_ylabel('% utilized', color='r')
# ax2.tick_params('y', colors='r')
# ax2.set_ylim([0,1])
# ax2.yaxis.set_major_locator(ticker.MultipleLocator(1 / 7))

Rough usage weighting

While the above utilization plot (figure 2) takes a binary view of each volume in a given year -- either clicked on, or not clicked on -- the following plot (figure 3) shows the average views per item per year (ending at publication date 2015 to exclude some outliers that skew because of limited data and number of hits in the analytics).

Even more clearly than the above figure, this demonstrates that the materials from 1923 to 1963, corresponding to the CRMS review period, are disproportionately used while accounting for the variable number of volumes in a given year.

In [34]:
#Charting average view counts per volume per year

#Figure 3

avg_usage = counts_ids[(counts_ids.date > 1799) & (counts_ids.date < 2015)].groupby('date')['count'].mean()
#avg_usage_non_0 = counts_ids[(counts_ids.date > 1799) & (counts_ids.date < 2015) & (counts_ids['count'] > 0)].groupby('date')['count'].mean()
In [69]:
avg_usage.plot(figsize=(10,7))
#avg_usage_non_0.plot(figsize=(10,7))
plt.xlabel('Date of publication')
plt.ylabel('Average events per volume')
plt.title('Average usage per volume per year');
In [36]:
avg_usage_a = counts_ids[(counts_ids.date > 1799) & (counts_ids.date < 2015) & (counts_ids.access == 'allow')].groupby('date')['count'].mean()
avg_usage_d = counts_ids[(counts_ids.date > 1799) & (counts_ids.date < 2015) & (counts_ids.access == 'deny')].groupby('date')['count'].mean()

If we limit to only open volumes, plotting average usage shows even more distinct regions (figure 4):

In [87]:
#Figure 4

avg_usage_a.plot(figsize=(10,7))
plt.xlabel('Date of publication')
plt.ylabel('Average events per "Full View" volume')
plt.title('Average usage per "Full View" volume per year');

Publication date historgram

In [38]:
pdata = pd.DataFrame()

pdata.loc[:, 'all'] = dates['count'] 
#pdata.loc[:, 'accessed'] = allow_years
#pdata.loc[:, 'denied'] = deny_years

The cell below (figure 5) shows the publication date distribution of everything in the HathiTrust corpus

In [90]:
#Figure 5

import seaborn as sns
pdata.plot(figsize=(11.5,7))
plt.xlabel('Date of publication')
plt.ylabel('Total number of volumes')
plt.title('All HathiTrust volumes by publication year');

For comparison, figure 6 is the distribution of all publications in WorldCat, as estimated by Brian F. Lavoie and Roger C. Schonfeld in a 2006 Journal of Electronic Publishing article titled "Books without Boundaries: A Brief Tour of the System-wide Print Book Collection"

Year distribution of all publications from WorldCat

For context, figure 7 shows the same data shown above in the Hathi publication date plot, but on the same scale as the OCLC data:

In [93]:
#Figure 7

pdata.plot(figsize=(11.5,7)).set_ylim(0, 700000)
plt.xlabel('Date of publication')
plt.ylabel('Total number of volumes')
plt.title('All HathiTrust volumes by publication year');

In contrast to that, figure 8 below shows the plot of volumes in HathiTrust that triggered an analytics event of some kind on the same scale.

In [94]:
#Figure 8

all_years.plot(figsize=(11.5,7)).set_ylim(0, 700000)
plt.xlabel('Date of publication')
plt.ylabel('Total number of volumes')
plt.title('All HathiTrust volumes with analytics events by publication year');

Ongoing publications (serials) analysis

Because items with a publication date of '9999' are not included in the range of useful histograms, useful to do a separate analysis of items with this very specific publication date, most of which are serials or other ongoing publications.

In [53]:
all_ongoing_a = pd.read_sql_query("SELECT publication_date, count(id) \
                    from hathifiles where publication_date = 9999 and access = 'allow' GROUP BY publication_date", con=con)
all_ongoing_d = pd.read_sql_query("SELECT publication_date, count(id) \
                    from hathifiles where publication_date = 9999 and access = 'deny' GROUP BY publication_date", con=con)
all_ongoing =   pd.read_sql_query("SELECT publication_date, count(id) \
                    from hathifiles where publication_date = 9999 GROUP BY publication_date", con=con)
In [54]:
#Ongoing volumes that triggered analytics events
ongoing_events = full[(full.date == 9999)].groupby('date')['id'].count()
ongoing_events_a = full[(full.date == 9999) & (full.access == 'allow')].groupby('date')['id'].count()
ongoing_events_d = full[(full.date == 9999) & (full.access == 'deny')].groupby('date')['id'].count()

print ("There are a total of %s volumes with a publication date of '9999' listed in the Hathifiles \n" % all_ongoing['count'].iloc[0])

print ("There are %s volumes with a publication date of '9999' that triggered analytics events" % ongoing_events.iloc[0])

print ("There are %s 'Full View' volumes with a publication date of '9999' that triggered analytics events" % ongoing_events_a.iloc[0])

print ("There are %s 'Limited View' volumes with a publication date of '9999' that triggered analytics events" % ongoing_events_d.iloc[0])
There are a total of 952214 volumes with a publication date of '9999' listed in the Hathifiles 

There are 154264 volumes with a publication date of '9999' that triggered analytics events
There are 69290 'Full View' volumes with a publication date of '9999' that triggered analytics events
There are 84974 'Limited View' volumes with a publication date of '9999' that triggered analytics events
In [95]:
#Figure 9

ongoing_plot = pd.DataFrame()

ongoing_plot.loc['full view', 'all possible volumes\n with "9999"'] = all_ongoing_a['count'].iloc[0]
ongoing_plot.loc['limited view', 'all possible volumes\n with "9999"'] = all_ongoing_d['count'].iloc[0]


ongoing_plot.loc['full view', 'Accessed volumes\n with "9999"'] = ongoing_events_a.iloc[0]
ongoing_plot.loc['limited view', 'Accessed volumes\n with "9999"'] = ongoing_events_d.iloc[0]

ongoing_plot.T.plot(kind='barh', color=['g','r'], stacked=True, legend=True, width=.4, linewidth=.4, figsize=(11.5,7))
plt.xlabel('Volumes')
plt.title('Volumes with 9999 for publication year');
In [59]:
from __future__ import division
# Determine ongoing serials utilization rate

print( "%s of all possible 9999s are open volumes" % all_ongoing_a['count'].iloc[0])

print( "%s of all open 9999 volumes have been accessed" % ongoing_events_a.iloc[0])

print("The overall utilization rate of open 9999s in HathiTrust is {0:.2f}%".format(100.*(int(ongoing_events_a.iloc[0]) / int(all_ongoing_a['count'].iloc[0]))))
177023 of all possible 9999s are open volumes
69290 of all open 9999 volumes have been accessed
The overall utilization rate of open 9999s in HathiTrust is 39.14%
In [112]:
#Top serials and ongoing publications with publication date of '9999'
print("Top Serials and other 9999 volumes:")
counts_ids[(counts_ids.date == 9999) & (counts_ids.access == 'allow') & (counts_ids['count'] > 0)].sort_values('count', ascending=False)[['id','title','date','count']].head(20)
Top Serials and other 9999 volumes:
Out[112]:
id title date count
111450 mdp.39015071886035 [Publications] 9999.0 38244.0
236936 wu.89059402255 Roster of the Confederate soldiers of Georgia,... 9999.0 33184.0
236938 wu.89059402289 Roster of the Confederate soldiers of Georgia,... 9999.0 30867.0
236937 wu.89059402263 Roster of the Confederate soldiers of Georgia,... 9999.0 25966.0
236940 wu.89059402313 Roster of the Confederate soldiers of Georgia,... 9999.0 24448.0
59460 mdp.39015002304221 Encyclopedia of American Quaker genealogy, by ... 9999.0 23178.0
236939 wu.89059402297 Roster of the Confederate soldiers of Georgia,... 9999.0 22978.0
111407 mdp.39015071884410 [Publications] 9999.0 19089.0
98058 mdp.39015051447657 Encyclopedia of American Quaker genealogy, by ... 9999.0 11257.0
111484 mdp.39015071888437 [Publications] 9999.0 11221.0
121935 mdp.39015095766716 Physical and biophysical foundations of pharma... 9999.0 10736.0
71128 mdp.39015011819037 The abridged compendium of American genealogy;... 9999.0 10528.0
70814 mdp.39015011488064 A lexicon of St. Thomas Aquinas based on the S... 9999.0 10446.0
238402 wu.89062853635 Bosworth genealogy; a history of the descendan... 9999.0 9023.0
111473 mdp.39015071887850 [Publications] 9999.0 8984.0
165127 pst.000023992122 Calendar of inquisitions miscellaneous, Chance... 9999.0 8819.0
111413 mdp.39015071884634 [Publications] 9999.0 8465.0
111441 mdp.39015071885722 [Publications] 9999.0 8418.0
131901 njp.32101067262574 Monumenta Ignatiana, ex autographis vel ex ant... 9999.0 7154.0
111393 mdp.39015071883651 [Publications] 9999.0 6924.0

Language analysis

In [61]:
#Grab the numbers for language of publication, as fas as the Hathifiles know
%time all_languages = pd.read_sql_query("SELECT language, count(language), publication_date \
           from hathifiles GROUP BY language, publication_date ORDER BY publication_date ASC", con=con)
CPU times: user 45.6 ms, sys: 8.05 ms, total: 53.7 ms
Wall time: 1min 39s
In [62]:
#Limit to just 1800-2017, to match other data
all_languages = all_languages[(all_languages.publication_date < 2018) & (all_languages.publication_date > 1799)]
In [63]:
#Reshape to get counts by date
all_languages = all_languages.sort_values(['publication_date','count'])

used_languages = counts_ids[(counts_ids.date < 2018) & (counts_ids.date > 1799)][['id','access','date','language']]
used_m = used_languages.groupby(['language','date'])['id'].count()

used_mm = used_languages.groupby(['language'])['id'].count()
used_mm = used_mm.sort_values(ascending=False)
In [64]:
# Similar reshape for all languages
aa = all_languages.groupby('language')['count'].sum()
aa = aa.sort_values(ascending=False)
In [65]:
#If we just want to look at top 20 languages by holding, use ab
ab = aa[aa.index[0:20]]

If we compare the "publication language" for all the items in the Hathifiles against the languages of just the volumes used since 1/1/2014, we can get a sense of which languages dominate the collection. In addition, we can then get a sense of which languages are under-utilized or over-utilized, relative to their overall holdings. (Figures 10 and 11)

In [96]:
#Figure 10

#Plotting the top 20 languages in terms of overall holdings
ab.sort_values(ascending=True).plot(kind='barh',figsize=(16,14))
plt.xlabel('Number of volumes in a given language')
plt.ylabel('Language')
plt.title('Top 20 Held Languages');
In [97]:
#Figure 11

#Plotting the top 20 languages in terms of used volumes
used_mm[0:20].sort_values(ascending=True).plot(kind='barh',figsize=(16,14))
plt.xlabel('Number of volumes in a given language used')
plt.ylabel('Language')
plt.title('Top 20 Most Used Languages');
In [74]:
#Merge used and all volumes data, and prep for porting to R
lf1 = pd.DataFrame({'series':aa})
lf2 = pd.DataFrame({'series':used_mm})
#merged_langs = aa.merge(used_mm)
lf = lf1.merge(lf2, left_index=True, right_index=True, how='outer')
lf.rename(columns={'series_x': 'total_volumes', 'series_y': 'used_volumes'}, inplace=True)

lf.loc[:, 'percent_used'] = lf.used_volumes / lf.used_volumes.sum() 
lf.loc[:, 'percent_all'] = lf.total_volumes / lf.total_volumes.sum() 
lf.loc[:, 'language'] = lf.index


lf = lf.sort_values('total_volumes', ascending=False)
In [75]:
# Port the top 50 languages, by holdings, to an R dataframe
rlf = pandas2ri.py2ri(lf[0:50])

Language utlization

The following two charts (figures 12 and 13) look at how the percentage of overall holdings compares to the percentage of used volumes for a given language.

The first chart (figure 12) compares usage vs holdings as a scatter plot, and includes a line marking the 1:1 slope -- therefore, languages above this line are disproportionately used relative to their volume in the overall corpus, while languages below the line are relatively under-utilized. Meanwhile, distance from the line indicates the extent of the disparity, so for example Ancient Greek (closer to the lower left) has relatively few holdings, but is heavily used given the volume, whereas there are significant holdings in Russian, Chinese, and Japanese (all closer to the top right) that are seemingly under-utilized.

For both charts, only the top 50 languages by holding are considered. Both charts use log scales to make the trends more visible.

In [101]:
%%R -w 800 -h 600 -u px -i rlf

#Figure 12

# Plotting the top 50 languages in terms of overall holdings, comparing usage vs holdings by percent (log scale)
# N.B. This excludes English, but English can be included by commenting out the line indicated below

library(ggrepel)
library(scales)
library(reshape2)
set.seed(42)

rlf %>%
filter(language != 'eng') %>% ### COMMENT THIS LINE OUT TO ADD 'eng' BACK TO THE PLOT
ggplot(aes(x=percent_all, y=percent_used, color=language))+
geom_point(size=4)+
geom_abline(intercept = 0, slope = 1)+
geom_text_repel(aes(percent_all, percent_used, label = language), size=8)+
theme_minimal(base_size = 18) +
theme(legend.position="none")+
scale_x_log10(labels = percent)+
scale_y_log10(labels = percent)+
xlab('Percent of All Volumes') + ylab('Percent of Used Volumes')+
ggtitle('Usage vs. Holdings Comparison')
In [102]:
%%R -w 800 -h 1000 -u px

#Figure 13

rlf %>%
mutate(indicator = ifelse( percent_all < 0.9*percent_used, 'Most heavily used',
                          ifelse(percent_all > 2*percent_used,
                                 "Most relatively under-used",
                                 "Roughly average use"))) %>%
select(-total_volumes, -used_volumes) %>%
rename(., 'Percent of \nUsed Volumes' = percent_used, 'Percent of \nAll Volumes' = percent_all) %>%
melt(                variable.name = "variable_percent",
                     value.names = "value_percent",
                     id.vars = c("language", "indicator"))  ->
rlf_melted



rlf_melted %>%{
ggplot(., aes(x = as.factor(reorder(variable_percent, desc(variable_percent))), y = value, group = language, color = indicator, label = language)) + 
  geom_line(size = 2, alpha = 3/10) + 
  geom_point(size = 5, alpha = 10/10) + 
 geom_text_repel(aes(),
                 show_guide = F,
                 size=6,
         nudge_x      = 0.25,
         direction    = "y",
         hjust        = 0,
          subset(., indicator != 'Roughly average use' & variable_percent  != 'Percent of \nAll Volumes')
                )+
      geom_text_repel(aes(),
                      show_guide = F,
                      size=6,
          nudge_x      = -0.25,
          direction    = "y",
          hjust        = 1, 
        subset(., indicator != 'Roughly average use' & variable_percent != 'Percent of \nUsed Volumes' )
                     )+
  theme_minimal(base_size = 18) + 
  scale_color_manual(values=c("darkgreen","#ff0000", '#dddddd'))+
  scale_y_log10(labels = percent)+
    guides(color=guide_legend(title="Use Relative to \nOverall Holdings:"))+
    xlab("")+ylab('Percent')+
    ggtitle('Relative Over- and Under-Utilization')
}

Query analysis

In [52]:
# Fun for later: 
# #To extract search queries

# def queries(): 
#     #Note: this extract only the first match of the pattern in the source string, but for our purposes this should be both fine and correct
#     
#     df = dd.read_csv('./data/uniqueEvents_201*.csv', delimiter='\|\~', engine='python')
#     pattern = '(?P<Query>q1=[^\&\n\|]*|lookfor=[^\&\n\|]*)'
#     t = df['ga:pagePath'].str.extract(pattern, expand=True)
#     df_queries = t[t['Query'].notnull()]
#     df_queries.to_csv('./csv/queries.csv', index=False)
# # %time queries()
In [53]:
#With some more parsing for common bigrams and unigrams, and using this repo: https://github.com/amueller/word_cloud
#I was able to produce what I think is a pretty compelling version of a word cloud:

(click through to see larger version)

HathiTrust search query word cloud