HathiTrust Usage Analytics & Metadata Analysis Notes¶

High-Level Steps¶

Scrape analytics
Process Analytics to extract volume IDs and usage counts
- Fix the dollar sign barcode issue
Ingest Hathifiles into postgres database
Match IDs from analytics to current metadata in Hathifiles
Create visualizations of interesting facets

Process Notes¶

Scrape analytics¶

Using Pyganalytics: https://github.com/chrpr/pyganalytics

Scrape the analytics using something like this (do this 4x, once for each quarter of the year, adjusting command as needed):

for i in {4..7}; do time python analytics.py -o ~/PATH/TO/DATA/OUTPUT/uniqueEvents_201$i\_01_01-201$i\_12_31.csv -c ~/PATH/TO/CONFIG/hathi_events_config.yml -f 201$i-01-01 -l 201$i-12-31; done

The yml config file is the following for the HathiTrust pageturner analytics account:

yml
query:
    metrics: ga:uniqueEvents
    dimensions: ga:pagePath
    sort:
    filters:
profile: 26478114

Note: I adjusted the delimiters used in the Pyganalitcs analytics.py script because page paths in Google Analytics contain everything under the sun and I needed to amend to try and find something that would be unique enough to work as a field separator that Pandas can recognize. However, this also means that you have to use the Python parsing engine when reading the CSV into Pandas, but given that this is a one-time operation I think the tradeoff here of doing less surgery on the analytics CSVs is worth the potential slowdown here.

I also edited the analytics.py script to run daily instead of weekly in order to try and capture more granular results. See notes on how to do that in the Pyganalytics readme.

Notes:

these tended to run between 30-60 minutes for each quarter on my machine, via a wireless connection
Broken into annual quarters just to keep file sizes more manageable and prevent less data loss if API errored out mid-file
this uses Google Analytics API v3, not v4 which is current, so may break sooner or later Set up an API key as described in the readme
There is a 10000 API call per profile ID for the Google Analytics API; as a result, I had to run this over the course of 3 days.

Other notes¶

Also something to explore: there are a limited set of results with zero pageviews, even just limiting to event triggers. But since even with an event count of zero, something is triggering the sending of these pages to analytics, I'm opting to include these at the moment.

Setup¶

!pip3 install sqlalchemy
!pip3 install rpy2

Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.6/site-packages (1.2.7)
Requirement already satisfied: rpy2 in /usr/local/lib/python3.6/site-packages (2.9.3)
Requirement already satisfied: six in /usr/local/lib/python3.6/site-packages (from rpy2) (1.11.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/site-packages (from rpy2) (2.10)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/site-packages (from jinja2->rpy2) (1.0)

#For inline matplotlib plots
%matplotlib inline

#For data manipulation and and analysis
import pandas as pd
import dask.dataframe as dd #USE `pip3 install 'dask[dataframe]'`
from dask.diagnostics import ProgressBar

#For plotting and dataviz
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set_context("talk", font_scale=1.2)

#So we can use the better ggplot2 R packages to create some of the diagrams below
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

%load_ext rpy2.ipython
%R library(tidyverse)

#Dask progress bar so we can get a view into long-running processes
pbar = ProgressBar()
pbar.register()

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

Process Analytics to extract volume IDs and usage counts¶

def extract_ids(): 
    
    '''
    Uses Dask to extracts HathiTrust IDs from the raw analytics logs and writes them to a series of CSVs
    '''
 
    df = dd.read_csv('./data/HathiTrust_unique_events_201*.csv')
    
    #Mix of my regex experimentation and pattern supplied by Angelina Z at Hathi:
    pattern = '(?:id=|[a-z0-9]\/)([a-z][a-z0-9]{1,3}\.\$?[a-z0-9._:\/\-]+)'
    
    #Extract the ID matches into a new column   
    df['id'] = df['ga:pagePath'].str.extract(pattern, expand=False)
    
    #Limit to just rows with ID matches
    df = df[df['id'].notnull()]

    #Remove rows that have 'skin=crms'
    df = df[df['ga:pagePath'].str.contains('skin=crms') == False]
    
    #Remove some junk punctuation from the end of some IDs
    df['id'] = df['id'].str.rstrip('._#')
    
    #Rename 'ga:uniqueEvents' column to 'count'
    df = df.rename(columns={'ga:uniqueEvents': 'count'})
    
    #Write the results to csvs
    df.to_csv('./data/all_ids_*.csv', index=False)
    
#This takes roughly 5min to run

%time extract_ids()

def ids_count(): 

    '''
    Uses output csvs of extract_ids() function to create a tuple of volume IDs paired with pageview counts 
    (either unique or aggregate pageview counts depending on how analytics were scraped) 
    and writes them to a CSV
    '''
    
    #Read in all the files with the extracted IDs and their counts
    df = dd.read_csv('./data/all_ids_*.csv')
    
    #Group that data by volume identifier, and then record the sum total of all hit counts
    ids = df.groupby(by=['id'])['count'].sum()
    
    #Turn dask dataframe into pandas dataframe so we can use sort_values (not implemented yet in dask)
    ids = ids.compute()    
    
    #Do some column naming, and then export to a CSV
    ids.index.name = 'id'
    ids.columns = ['count']
    ids = ids.sort_values(ascending=False).to_csv('./csv/all_counts_sorted.csv', header=True)

#This takes roughly 12min to run
%time ids_count()

[########################################] | 100% Completed |  3min  5.3s
[########################################] | 100% Completed |  3min  5.4s
[########################################] | 100% Completed |  3min  5.5s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.3s
CPU times: user 2min 42s, sys: 21.9 s, total: 3min 4s
Wall time: 3min 39s

pd.options.mode.chained_assignment = None

def fix_dollar_sign_ids():

    '''
    Need to Fix UC $ issue, and roll up the total counts so they're only counted 
    as single items with all hits. There's some ugly pandas in here, could certainly 
    be more effficient but it's quick enoughthat I'm calling it done for now
    '''
    
    #Need numpy briefly to do the conditional check later using .where() method
    import numpy as np
    
    #Import the extracted IDs and counts
    df = pd.read_csv('./csv/all_counts_sorted.csv')
    
    #New dataframe of just the potentially affected dollar sign IDs
    dollars = df[df['id'].str.contains('\$')]

    #New column with the id version minus dollar sign
    dollars.loc[:,'fixed_id'] = dollars['id'].copy().str.replace('$','')
    
    #Merge with original data to get access to all counts we need to sum
    merged = df.merge(dollars, left_on='id', right_on='fixed_id', suffixes=['_df','_dollars'])

    #Sum the hit counts for the dollar sign IDs and the non dollar sign IDs
    merged.loc[:,'total'] = merged['count_df'] + merged['count_dollars']

    #Merge the totals with the original dataset again, but in a new df just to be careful
    df2 = df.merge(merged, how='outer', left_on='id', right_on='id_dollars')
    
    #Update the original 'count' column to the holistic total where needed
    df2.loc[:, 'count'] = np.where(df2['total'].notnull() == True, df2['total'], df['count'])
    
    #Remove the non-dollar sign IDs from the df, since the dollar-sign-id totals now reflect 
    #the holistic sum. Also removes the now extraneous extra columns
    df2 = df2[df2['id'].isin(df2['id_df']) == False].copy()[['id','count']]
    
    #Write it all out to a new csv
    df2.sort_values('count', ascending=False).to_csv('./csv/all_counts_sorted_dollar_fixed.csv')

#This takes like 35 seconds to run
%time fix_dollar_sign_ids()

CPU times: user 32.5 s, sys: 2.6 s, total: 35.1 s
Wall time: 39.7 s

Postgres import of Hathifiles¶

#Set up the postgres connection

import sqlalchemy as sa

def connect(user, password, db, host='localhost', port=5432):
    '''Returns a connection and a metadata object'''
    # We connect with the help of the PostgreSQL URL
    url = 'postgresql://{}:{}@{}:{}/{}'
    url = url.format(user, password, host, port, db)

    # The return value of create_engine() is our connection object
    con = sa.create_engine(url, client_encoding='utf8')

    # We then bind the connection to MetaData()
    meta = sa.MetaData(bind=con).reflect()

    return con, meta

con, meta = connect('postgres', '', 'hathifiles')

/usr/local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)

#Import the Hathifiles into a postgres database


'''
This cell does require ones step not included here which is adding the header row 
to the txt file with the column names, which I do in an inelegant but fine way: 
- Copy row headers, including tab delimiters, to new file
- Append text of massive hathi text file (~4gb) to that new file 
    `cat hathi_full_20171101.txt >> headers_for_hathi_full_20171101.txt`
- Delete old file, and rename new file same as the old (but now includes column headers)

'''

#Struggled for a long time with this, but turns out the delimiter needs to be r'\t', not just '\t'
#hathi_data = pd.read_csv('./data/hathifiles/hathi_full_20180501.txt', engine='python', delimiter=r'\t', encoding='utf-8', chunksize=1000000)


def postgres_import():
    i = 0
    for chunk in hathi_data:
        chunk = chunk[['id','access', 'rights', 'hathitrust_record_number', 'enumeration_chronology', 'source', 'source_institution_record_number', 'oclc_numbers', 'isbns', 'issns', 'lccns', 'title', 'imprint', 'rights_determination_reason_code', 'date_of_last_update', 'government_document', 'publication_date', 'publication_place' ,'language', 'bibliographic_format', 'collection_code', 'content_provider_code', 'responsible_entity_code', 'digitization_agent_code']]
        try:
            chunk.to_sql('hathifiles', con, if_exists='append')
            print (i, chunk.index[0])
            i += 1
        except:
            print (chunk.index[0])

#This takes roughly 2 hours 10min to run on my macbook air
#%time postgres_import()

Match IDs from analytics to current metadata in Hathifiles¶

def get_access_and_date():

    gf = pd.read_csv('./csv/all_counts_sorted_dollar_fixed.csv')

    gf['access'] = ''
    gf['date'] = ''
    gf['title'] = ''
    gf['oclc'] = ''
#     gf['format'] = ''
#     gf['pub_place'] = ''
    print (len(gf))

    header = 'id,title,access,date,oclc,gov_doc,publication_place,language,bib_format'
    text_file = open("./csv/all_id_title_access_date_oclc_languages_formats.csv", "w")
    text_file.write(header+'\n')
    text_file.close()
    
    for i in range(0,len(gf),250000):
        xf = gf[i:(i+250000)]
        ids = []
        for index, row in xf.iterrows():
            ids.append(row['id'])
        x = pd.read_sql_query("select id, title, access, publication_date, oclc_numbers, government_document, publication_place, language, bibliographic_format \
                                from hathifiles where id in"+str(tuple(ids)), con=con)
        x.to_csv('./csv/all_id_title_access_date_oclc_languages_formats.csv', mode='a', encoding='utf-8', header=False, index=False)
        if i % 10000 == 0:
            print (i)

#This takes about 33min if the postgres table has an index on "id"            
#%time get_access_and_date()

Analysis and viz steps¶

Some cells below are just checks to make sure outputs look roughly correct, others set up some data for visualization or other manipulation

#Read our various CSVs into dataframes so we can work with them 
counts = pd.read_csv('./csv/all_counts_sorted.csv')
d_counts = pd.read_csv('./csv/all_counts_sorted_dollar_fixed.csv')
full = pd.read_csv('./csv/all_id_title_access_date_oclc_languages_formats.csv')

# Gut check on dollar sign fix and missing volumes
cs = set(counts['id'])
ds = set(d_counts['id'])
fs = set(full['id'])

#Total number of dollar sign ids fixed
print("dollar sign ids fixed: %s" % (len(cs - ds)))


'''
To get a sense if we've missed things with the above data transformations
This spits out the things that the parsing found as IDs, but couldn't find in the HathiFiles
Generally, there are a few hundred here, things that have been added to Hathi via ingest and therefore
Are available to analytics events, but haven't had their metadata added to the monthly Hathifile updates
Gut checking, a delta of less than a thousand a month seems ok, and not something I'm super worried about 
in terms of skewing data
'''
diff = ds - fs

missing = []
def print_missing():
    for d in diff: 
        if "uc1" not in d:
            missing.append(d)
        else:
            pass
    print ("Total non-dollar sign IDs in data, but not found in hathifiles: %s" % (len(missing)))

    for item in missing:
        print (item)
    
#print_missing()

dollar sign ids fixed: 456

'''This merges the metadata extracted from the Hathifiles with the top counts from the analytics,
and spits out the list of the most viewed items in HathiTrust
But this could easily be tweaked to show top NYPL items, top items that were denied access (excellent ILL candidates),
top items published in a given country, etc. '''
counts_ids = full.merge(d_counts, on='id', suffixes=['_full','_counts'])
counts_ids = counts_ids.drop('Unnamed: 0', axis=1)

#Double check -- how many items had event triggers recorded in order to appear in the analytics, but had the count recorded as zero?
zeroes = counts_ids[counts_ids['count'] < 1]
print (len(zeroes))

0

#full = pd.read_csv('./csv/all_id_title_access_date_oclc.csv')

allow = full[full.access == 'allow']
deny = full[full.access == 'deny']

print ("%s total volumes that triggered analytics events in the collected data" % len(full))

print ("%s total open volumes have triggered analytics events in the collected data" % len(allow))

print ("%s total limited view volumes have triggered analytics events in the collected data" % len(deny))

4810938 total volumes that triggered analytics events in the collected data
3268810 total open volumes have triggered analytics events in the collected data
1542128 total limited view volumes have triggered analytics events in the collected data

Top title analysis¶

Top 25 titles in Hathi¶

#Top 25 titles in Hathi
counts_ids[['id','title','date','count']].sort_values('count', ascending=False).head(25)

Top 25 limited view titles in Hathi¶

#Top 25 limited view titles in Hathi
counts_ids[counts_ids.access == 'deny'][['id','title','date','count']].sort_values('count', ascending=False).head(25)

Top 25 Hathi volumes scanned from NYPL collections¶

#Top 25 Hathi volumes scanned from NYPL collections
counts_ids[counts_ids['id'].str.startswith('nyp') == True][['id','title','date','count']].sort_values('count', ascending=False).head(25)

Publication Year Analysis¶

N.B. See the interactive dataviz for a larger and more useful view of this data

all_years = full[(full.date > 1799) & (full.date < 2018)].groupby('date')['id'].count()
allow_years = allow[(allow.date > 1799) & (allow.date < 2018)].groupby('date')['id'].count()
deny_years = deny[(deny.date > 1799) & (deny.date < 2018)].groupby('date')['id'].count()

#Plots publication date of volumes with analytics events, as full-view vs limited view volumes

#Figure 1

#all_years.plot()
allow_years.plot(figsize=(10,7))
deny_years.plot(figsize=(10,7))
plt.xlabel('Date of publication')
plt.ylabel('total number of volumes')
plt.title('"Full View" (blue) and "Limited View" (orange) clicked volumes in HathiTrust');

#This grabs all dates from the postgres DB to some data analysis and histograms, etc.

all_dates = pd.read_sql_query("SELECT DISTINCT publication_date, count(publication_date) \
           from hathifiles GROUP BY publication_date ORDER BY publication_date ASC", con=con)
all_dates_a = pd.read_sql_query("SELECT DISTINCT publication_date, count(publication_date) \
           from hathifiles WHERE access = 'allow' GROUP BY publication_date ORDER BY publication_date ASC", con=con)
all_dates_d = pd.read_sql_query("SELECT DISTINCT publication_date, count(publication_date) \
           from hathifiles WHERE access = 'deny' GROUP BY publication_date ORDER BY publication_date ASC", con=con)

dates = all_dates[(all_dates.publication_date < 2018) & (all_dates.publication_date > 1799)]
dates_a = all_dates_a[(all_dates_a.publication_date < 2018) & (all_dates_a.publication_date > 1799)]
dates_d = all_dates_d[(all_dates_d.publication_date < 2018) & (all_dates_d.publication_date > 1799)]

dates.index = dates.publication_date
dates_a.index = dates_a.publication_date
dates_d.index = dates_d.publication_date
dates_a.loc[:,'accessed'] = allow_years

dates.loc[:,'full view'] = dates_a['count']
dates.loc[:,'limited view'] = dates_d['count']
dates.loc[:,'full view accessed'] = allow_years
dates.loc[:,'limited view attempted'] = deny_years

#dates_a
#dates

#To export all counts, full view counts, and limited view counts to a CSV for easier use later: 
#dates.to_csv('./csv/all_years_data.csv')

/usr/local/lib/python3.6/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Utilization analysis¶

Below takes a look at the total number of openly available volumes, and what percentage have been accessed since mid-2013 (figure 2)

#Calculate raw utilization rate

util = dates['full view accessed'] / dates['count']
dates_a.copy().loc[:, 'percent'] = dates_a.accessed.copy() / dates_a['count'].copy()

print ("The overall average utilization rate for open volumes 1800-2017 is: {0:.2f}%\n".format(100*dates_a['percent'].mean()))
print ("The average utilization rate for open volumes 1800-1875 is: {0:.2f}%\n".format(100*dates_a[1800:1875]['percent'].mean()))
print ("The average utilization rate for open volumes 1876-1922 is: {0:.2f}%\n".format(100*dates_a[1876:1922]['percent'].mean()))
print ("The average utilization rate for open volumes 1923-1962 is: {0:.2f}%\n".format(100*dates_a[1923:1962]['percent'].mean()))
print ("The average utilization rate for open volumes 1963-2017 is: {0:.2f}%\n".format(100*dates_a[1963:2017]['percent'].mean()))

The overall average utilization rate for open volumes 1800-2017 is: 56.08%

The average utilization rate for open volumes 1800-1875 is: 60.90%
The average utilization rate for open volumes 1876-1922 is: 51.86%
The average utilization rate for open volumes 1923-1962 is: 61.47%
The average utilization rate for open volumes 1963-2017 is: 49.09%

#plot open volume utilization per year

#Figure 2

dates_a['percent'].plot(figsize=(10,7)).set_ylim([0,1])
plt.xlabel('Date of publication')
plt.ylabel('Percentage of open volumes accessed')
plt.title('"Full View" volumes utilization per year');

# #Plot publication year distribution of accessed volumes (blue, left) against utilization rate by year (red, right)
# import matplotlib.ticker as ticker

# fig, ax1 = plt.subplots(figsize=(10,7))
# ax1.plot(allow_years, color='b')
# ax1.set_ylabel('# of works', color='b')
# ax1.tick_params('y', colors='b')
# #ax1.set_yticks([0,10000])
# ax1.set_ylim([0,70000])

# ax2 = ax1.twinx()
# ax2.plot(util, color='r')
# ax2.set_ylabel('% utilized', color='r')
# ax2.tick_params('y', colors='r')
# ax2.set_ylim([0,1])
# ax2.yaxis.set_major_locator(ticker.MultipleLocator(1 / 7))

Rough usage weighting¶

While the above utilization plot (figure 2) takes a binary view of each volume in a given year -- either clicked on, or not clicked on -- the following plot (figure 3) shows the average views per item per year (ending at publication date 2015 to exclude some outliers that skew because of limited data and number of hits in the analytics).

Even more clearly than the above figure, this demonstrates that the materials from 1923 to 1963, corresponding to the CRMS review period, are disproportionately used while accounting for the variable number of volumes in a given year.

#Charting average view counts per volume per year

#Figure 3

avg_usage = counts_ids[(counts_ids.date > 1799) & (counts_ids.date < 2015)].groupby('date')['count'].mean()
#avg_usage_non_0 = counts_ids[(counts_ids.date > 1799) & (counts_ids.date < 2015) & (counts_ids['count'] > 0)].groupby('date')['count'].mean()

avg_usage.plot(figsize=(10,7))
#avg_usage_non_0.plot(figsize=(10,7))
plt.xlabel('Date of publication')
plt.ylabel('Average events per volume')
plt.title('Average usage per volume per year');

avg_usage_a = counts_ids[(counts_ids.date > 1799) & (counts_ids.date < 2015) & (counts_ids.access == 'allow')].groupby('date')['count'].mean()
avg_usage_d = counts_ids[(counts_ids.date > 1799) & (counts_ids.date < 2015) & (counts_ids.access == 'deny')].groupby('date')['count'].mean()

If we limit to only open volumes, plotting average usage shows even more distinct regions (figure 4):

#Figure 4

avg_usage_a.plot(figsize=(10,7))
plt.xlabel('Date of publication')
plt.ylabel('Average events per "Full View" volume')
plt.title('Average usage per "Full View" volume per year');

Publication date historgram¶

pdata = pd.DataFrame()

pdata.loc[:, 'all'] = dates['count'] 
#pdata.loc[:, 'accessed'] = allow_years
#pdata.loc[:, 'denied'] = deny_years

The cell below (figure 5) shows the publication date distribution of everything in the HathiTrust corpus

#Figure 5

import seaborn as sns
pdata.plot(figsize=(11.5,7))
plt.xlabel('Date of publication')
plt.ylabel('Total number of volumes')
plt.title('All HathiTrust volumes by publication year');

For comparison, figure 6 is the distribution of all publications in WorldCat, as estimated by Brian F. Lavoie and Roger C. Schonfeld in a 2006 Journal of Electronic Publishing article titled "Books without Boundaries: A Brief Tour of the System-wide Print Book Collection"

Year distribution of all publications from WorldCat

For context, figure 7 shows the same data shown above in the Hathi publication date plot, but on the same scale as the OCLC data:

#Figure 7

pdata.plot(figsize=(11.5,7)).set_ylim(0, 700000)
plt.xlabel('Date of publication')
plt.ylabel('Total number of volumes')
plt.title('All HathiTrust volumes by publication year');

In contrast to that, figure 8 below shows the plot of volumes in HathiTrust that triggered an analytics event of some kind on the same scale.

#Figure 8

all_years.plot(figsize=(11.5,7)).set_ylim(0, 700000)
plt.xlabel('Date of publication')
plt.ylabel('Total number of volumes')
plt.title('All HathiTrust volumes with analytics events by publication year');

Ongoing publications (serials) analysis¶

Because items with a publication date of '9999' are not included in the range of useful histograms, useful to do a separate analysis of items with this very specific publication date, most of which are serials or other ongoing publications.

all_ongoing_a = pd.read_sql_query("SELECT publication_date, count(id) \
                    from hathifiles where publication_date = 9999 and access = 'allow' GROUP BY publication_date", con=con)
all_ongoing_d = pd.read_sql_query("SELECT publication_date, count(id) \
                    from hathifiles where publication_date = 9999 and access = 'deny' GROUP BY publication_date", con=con)
all_ongoing =   pd.read_sql_query("SELECT publication_date, count(id) \
                    from hathifiles where publication_date = 9999 GROUP BY publication_date", con=con)

#Ongoing volumes that triggered analytics events
ongoing_events = full[(full.date == 9999)].groupby('date')['id'].count()
ongoing_events_a = full[(full.date == 9999) & (full.access == 'allow')].groupby('date')['id'].count()
ongoing_events_d = full[(full.date == 9999) & (full.access == 'deny')].groupby('date')['id'].count()

print ("There are a total of %s volumes with a publication date of '9999' listed in the Hathifiles \n" % all_ongoing['count'].iloc[0])

print ("There are %s volumes with a publication date of '9999' that triggered analytics events" % ongoing_events.iloc[0])

print ("There are %s 'Full View' volumes with a publication date of '9999' that triggered analytics events" % ongoing_events_a.iloc[0])

print ("There are %s 'Limited View' volumes with a publication date of '9999' that triggered analytics events" % ongoing_events_d.iloc[0])

There are a total of 952214 volumes with a publication date of '9999' listed in the Hathifiles 

There are 154264 volumes with a publication date of '9999' that triggered analytics events
There are 69290 'Full View' volumes with a publication date of '9999' that triggered analytics events
There are 84974 'Limited View' volumes with a publication date of '9999' that triggered analytics events

#Figure 9

ongoing_plot = pd.DataFrame()

ongoing_plot.loc['full view', 'all possible volumes\n with "9999"'] = all_ongoing_a['count'].iloc[0]
ongoing_plot.loc['limited view', 'all possible volumes\n with "9999"'] = all_ongoing_d['count'].iloc[0]


ongoing_plot.loc['full view', 'Accessed volumes\n with "9999"'] = ongoing_events_a.iloc[0]
ongoing_plot.loc['limited view', 'Accessed volumes\n with "9999"'] = ongoing_events_d.iloc[0]

ongoing_plot.T.plot(kind='barh', color=['g','r'], stacked=True, legend=True, width=.4, linewidth=.4, figsize=(11.5,7))
plt.xlabel('Volumes')
plt.title('Volumes with 9999 for publication year');

from __future__ import division
# Determine ongoing serials utilization rate

print( "%s of all possible 9999s are open volumes" % all_ongoing_a['count'].iloc[0])

print( "%s of all open 9999 volumes have been accessed" % ongoing_events_a.iloc[0])

print("The overall utilization rate of open 9999s in HathiTrust is {0:.2f}%".format(100.*(int(ongoing_events_a.iloc[0]) / int(all_ongoing_a['count'].iloc[0]))))

177023 of all possible 9999s are open volumes
69290 of all open 9999 volumes have been accessed
The overall utilization rate of open 9999s in HathiTrust is 39.14%

#Top serials and ongoing publications with publication date of '9999'
print("Top Serials and other 9999 volumes:")
counts_ids[(counts_ids.date == 9999) & (counts_ids.access == 'allow') & (counts_ids['count'] > 0)].sort_values('count', ascending=False)[['id','title','date','count']].head(20)

Top Serials and other 9999 volumes:

Language analysis¶

#Grab the numbers for language of publication, as fas as the Hathifiles know
%time all_languages = pd.read_sql_query("SELECT language, count(language), publication_date \
           from hathifiles GROUP BY language, publication_date ORDER BY publication_date ASC", con=con)

CPU times: user 45.6 ms, sys: 8.05 ms, total: 53.7 ms
Wall time: 1min 39s

#Limit to just 1800-2017, to match other data
all_languages = all_languages[(all_languages.publication_date < 2018) & (all_languages.publication_date > 1799)]

#Reshape to get counts by date
all_languages = all_languages.sort_values(['publication_date','count'])

used_languages = counts_ids[(counts_ids.date < 2018) & (counts_ids.date > 1799)][['id','access','date','language']]
used_m = used_languages.groupby(['language','date'])['id'].count()

used_mm = used_languages.groupby(['language'])['id'].count()
used_mm = used_mm.sort_values(ascending=False)

# Similar reshape for all languages
aa = all_languages.groupby('language')['count'].sum()
aa = aa.sort_values(ascending=False)

#If we just want to look at top 20 languages by holding, use ab
ab = aa[aa.index[0:20]]

If we compare the "publication language" for all the items in the Hathifiles against the languages of just the volumes used since 1/1/2014, we can get a sense of which languages dominate the collection. In addition, we can then get a sense of which languages are under-utilized or over-utilized, relative to their overall holdings. (Figures 10 and 11)

#Figure 10

#Plotting the top 20 languages in terms of overall holdings
ab.sort_values(ascending=True).plot(kind='barh',figsize=(16,14))
plt.xlabel('Number of volumes in a given language')
plt.ylabel('Language')
plt.title('Top 20 Held Languages');

#Figure 11

#Plotting the top 20 languages in terms of used volumes
used_mm[0:20].sort_values(ascending=True).plot(kind='barh',figsize=(16,14))
plt.xlabel('Number of volumes in a given language used')
plt.ylabel('Language')
plt.title('Top 20 Most Used Languages');

#Merge used and all volumes data, and prep for porting to R
lf1 = pd.DataFrame({'series':aa})
lf2 = pd.DataFrame({'series':used_mm})
#merged_langs = aa.merge(used_mm)
lf = lf1.merge(lf2, left_index=True, right_index=True, how='outer')
lf.rename(columns={'series_x': 'total_volumes', 'series_y': 'used_volumes'}, inplace=True)

lf.loc[:, 'percent_used'] = lf.used_volumes / lf.used_volumes.sum() 
lf.loc[:, 'percent_all'] = lf.total_volumes / lf.total_volumes.sum() 
lf.loc[:, 'language'] = lf.index


lf = lf.sort_values('total_volumes', ascending=False)

# Port the top 50 languages, by holdings, to an R dataframe
rlf = pandas2ri.py2ri(lf[0:50])

Language utlization¶

The following two charts (figures 12 and 13) look at how the percentage of overall holdings compares to the percentage of used volumes for a given language.

The first chart (figure 12) compares usage vs holdings as a scatter plot, and includes a line marking the 1:1 slope -- therefore, languages above this line are disproportionately used relative to their volume in the overall corpus, while languages below the line are relatively under-utilized. Meanwhile, distance from the line indicates the extent of the disparity, so for example Ancient Greek (closer to the lower left) has relatively few holdings, but is heavily used given the volume, whereas there are significant holdings in Russian, Chinese, and Japanese (all closer to the top right) that are seemingly under-utilized.

For both charts, only the top 50 languages by holding are considered. Both charts use log scales to make the trends more visible.

%%R -w 800 -h 600 -u px -i rlf

#Figure 12

# Plotting the top 50 languages in terms of overall holdings, comparing usage vs holdings by percent (log scale)
# N.B. This excludes English, but English can be included by commenting out the line indicated below

library(ggrepel)
library(scales)
library(reshape2)
set.seed(42)

rlf %>%
filter(language != 'eng') %>% ### COMMENT THIS LINE OUT TO ADD 'eng' BACK TO THE PLOT
ggplot(aes(x=percent_all, y=percent_used, color=language))+
geom_point(size=4)+
geom_abline(intercept = 0, slope = 1)+
geom_text_repel(aes(percent_all, percent_used, label = language), size=8)+
theme_minimal(base_size = 18) +
theme(legend.position="none")+
scale_x_log10(labels = percent)+
scale_y_log10(labels = percent)+
xlab('Percent of All Volumes') + ylab('Percent of Used Volumes')+
ggtitle('Usage vs. Holdings Comparison')

%%R -w 800 -h 1000 -u px

#Figure 13

rlf %>%
mutate(indicator = ifelse( percent_all < 0.9*percent_used, 'Most heavily used',
                          ifelse(percent_all > 2*percent_used,
                                 "Most relatively under-used",
                                 "Roughly average use"))) %>%
select(-total_volumes, -used_volumes) %>%
rename(., 'Percent of \nUsed Volumes' = percent_used, 'Percent of \nAll Volumes' = percent_all) %>%
melt(                variable.name = "variable_percent",
                     value.names = "value_percent",
                     id.vars = c("language", "indicator"))  ->
rlf_melted



rlf_melted %>%{
ggplot(., aes(x = as.factor(reorder(variable_percent, desc(variable_percent))), y = value, group = language, color = indicator, label = language)) + 
  geom_line(size = 2, alpha = 3/10) + 
  geom_point(size = 5, alpha = 10/10) + 
 geom_text_repel(aes(),
                 show_guide = F,
                 size=6,
         nudge_x      = 0.25,
         direction    = "y",
         hjust        = 0,
          subset(., indicator != 'Roughly average use' & variable_percent  != 'Percent of \nAll Volumes')
                )+
      geom_text_repel(aes(),
                      show_guide = F,
                      size=6,
          nudge_x      = -0.25,
          direction    = "y",
          hjust        = 1, 
        subset(., indicator != 'Roughly average use' & variable_percent != 'Percent of \nUsed Volumes' )
                     )+
  theme_minimal(base_size = 18) + 
  scale_color_manual(values=c("darkgreen","#ff0000", '#dddddd'))+
  scale_y_log10(labels = percent)+
    guides(color=guide_legend(title="Use Relative to \nOverall Holdings:"))+
    xlab("")+ylab('Percent')+
    ggtitle('Relative Over- and Under-Utilization')
}

Query analysis¶

# Fun for later: 
# #To extract search queries

# def queries(): 
#     #Note: this extract only the first match of the pattern in the source string, but for our purposes this should be both fine and correct
#     
#     df = dd.read_csv('./data/uniqueEvents_201*.csv', delimiter='\|\~', engine='python')
#     pattern = '(?P<Query>q1=[^\&\n\|]*|lookfor=[^\&\n\|]*)'
#     t = df['ga:pagePath'].str.extract(pattern, expand=True)
#     df_queries = t[t['Query'].notnull()]
#     df_queries.to_csv('./csv/queries.csv', index=False)
# # %time queries()

#With some more parsing for common bigrams and unigrams, and using this repo: https://github.com/amueller/word_cloud
#I was able to produce what I think is a pretty compelling version of a word cloud:

(click through to see larger version)

HathiTrust search query word cloud

	id	title	date	count
98680	mdp.39015054061430	Quicksand, by Nella Larsen.	1928.0	127787.0
70341	mdp.39015011274175	The surnames of Scotland, their origin meaning...	1962.0	77211.0
61560	mdp.39015004111095	Godey's magazine.	1850.0	70049.0
57531	mdp.39015000566789	America is in the heart, a personal history, b...	1946.0	61899.0
166025	pst.000057937434	The human figure / by John H. Vanderpoel.	1907.0	61169.0
183186	uc1.32106007458745	History of wages in the United States from Col...	1934.0	61035.0
104139	mdp.39015064340733	Solid mensuration, by Willis F. Kern and James...	1934.0	53630.0
57749	mdp.39015000804453	Perfume and flavor materials of natural origin.	1960.0	53452.0
66525	mdp.39015008158415	Quintus Curtius [History of Alexander] with an...	1946.0	45034.0
176102	uc1.$b99721	The five laws of library science, by S. R. Ran...	1931.0	44317.0
93003	mdp.39015038069475	Return to life through contrology, by Joseph H...	1960.0	43525.0
111450	mdp.39015071886035	[Publications]	9999.0	38244.0
167555	pur1.32754077064610	Investigation of Korean-American relations : R...	1978.0	37209.0
64981	mdp.39015006749868	Modern California houses; case study houses, 1...	1962.0	35388.0
96448	mdp.39015048226941	The lesson of Japanese architecture.	1954.0	33520.0
236936	wu.89059402255	Roster of the Confederate soldiers of Georgia,...	9999.0	33184.0
73616	mdp.39015014103017	The book of a hundred hands.	1920.0	32199.0
57261	mdp.39015000379902	Propaganda technique in the World War [by] Har...	1938.0	32115.0
236938	wu.89059402289	Roster of the Confederate soldiers of Georgia,...	9999.0	30867.0
66524	mdp.39015008158407	Quintus Curtius [History of Alexander] with an...	1946.0	30823.0
218162	uiug.30112101024682	A short guide to New Zealand.	1943.0	30728.0
176728	uc1.31158001963940	Sears; [catalog]	1897.0	30299.0
236184	wu.89054395496	Architecture and furniture: Aalto.	1938.0	29711.0
195221	uc1.b4164139	Circuit analysis of A-C power systems; symmetr...	1950.0	29455.0
43993	inu.30000007109121	Pennsylvania German pioneers; a publication of...	1934.0	28781.0

	id	title	date	count
63936	mdp.39015005717635	Insularismo; ensayos de interpretación puertor...	1942.0	10496.0
45447	inu.30000103012815	Kasaita / na Maryam Kabir Mashi.	9999.0	7562.0
106335	mdp.39015066789838	Theogony ; and, Works and days / Hesiod ; tran...	2006.0	7176.0
46203	inu.30000124268446	Rufaida ko mufida? / na Hadiza Salisu Sharif.	9999.0	5947.0
245007	wu.89089669436	Elements of applied thermodynamics, by Robert ...	1951.0	4207.0
183515	uc1.32106010955992	NCRP report	9999.0	4119.0
198128	uc1.b4906221	The competent manager : a model for effective ...	1982.0	3905.0
95880	mdp.39015046422120	Nectar in a sieve, a novel.	1954.0	3838.0
179916	uc1.31822003845724	The Naval Institute guide to world naval weapo...	1989.0	3660.0
122620	mdp.39076006350719	The theory of spherical and ellipsoidal harmon...	1931.0	3374.0
69304	mdp.39015010576356	Objects of daily use, with over 1800 figures f...	1927.0	3290.0
69301	mdp.39015010574575	Catalogue of Alexandrian coins,	1933.0	2493.0
111334	mdp.39015071754159	The Michigan daily.	1969.0	2424.0
98163	mdp.39015052047589	American archival studies : readings in theory...	2000.0	2355.0
183962	uc1.32106017264836	Female pipings in Eden, by Ethel Smyth ..	1934.0	2178.0
238807	wu.89062880828	Launcelot Granger of Newbury, Mass., and Suffi...	1989.0	2161.0
237107	wu.89060437159	A genealogical memoir of the Lo-Lathrop family...	1971.0	2139.0
238137	wu.89062511985	A genealogy of the Van Winkle family : account...	1996.0	2081.0
240405	wu.89066083726	Genealogical and biographical notices of desce...	1996.0	2064.0
235650	wu.89047254495	The co-operative directory : a complete record...	1940.0	2063.0
45345	inu.30000095331652	Calendar of inquisitions miscellaneous, Chance...	9999.0	1950.0
240548	wu.89066140377	Genealogy of the Folsom family : a revised and...	9999.0	1921.0
45446	inu.30000103009852	Uwar miji ko kishiya? / na Zahra'u Baba Abdull...	9999.0	1891.0
241737	wu.89067471904	Erie County, New York obituaries, as found in ...	1992.0	1838.0
70805	mdp.39015011482067	Württembergisches Adels- und Wappenbuch / im A...	1975.0	1814.0

	id	title	date	count
152464	nyp.33433076064025	Miranda / by Grace Livingston Hill Lutz ... ; ...	1915.0	8362.0
147349	nyp.33433066397708	Illustrated catalogue of hand and power pumps,...	1903.0	6644.0
149765	nyp.33433069455859	Illustrated trade catalogue and price list : m...	1897.0	6603.0
142114	nyp.33433000335228	Glossarium ad scriptores mediae et infimae Lat...	1736.0	6486.0
153570	nyp.33433081675450	Godey's magazine.	1831.0	4961.0
150488	nyp.33433072182490	Art monograms and lettering, by J.M. Bergling,...	1912.0	4849.0
145673	nyp.33433023758695	Report of the Committee of Secrecy on the Bank...	1832.0	4578.0
155542	nyp.33433081893293	L'Egypte à L'Exposition universelle de 1867 /...	1867.0	4493.0
155125	nyp.33433081844692	A standard history of Stark County, Ohio : an ...	1916.0	4409.0
145650	nyp.33433023615366	Regimental colors of the German armies in the ...	1911.0	4354.0
142116	nyp.33433000335244	Glossarium ad scriptores mediae et infimae Lat...	1736.0	4051.0
143221	nyp.33433006773448	Home needlework magazine ...	1912.0	3955.0
144319	nyp.33433009488465	The law reports,. under the superintendence an...	1884.0	3805.0
142119	nyp.33433000335277	Glossarium ad scriptores mediae et infimae Lat...	1736.0	3734.0
156150	nyp.33433082137914	Wife no. 19, or the story of a life in bondage...	1875.0	3725.0
146016	nyp.33433037323635	Illustrated catalogue of Seth Thomas, New Have...	1878.0	3717.0
148004	nyp.33433066642897	American chess magazine.	1899.0	3683.0
160035	nyp.33433105621522	The Graphic : an illustrated weekly newspaper.	1874.0	3451.0
160461	nyp.33433112041938	A book of verses / by William Ernest Henley.	1888.0	3433.0
150364	nyp.33433071393981	Chinese pictures : notes on photographs made i...	1900.0	3388.0
158553	nyp.33433087540096	Hardware manufactured by P. & F. Corbin.	1905.0	3334.0
153583	nyp.33433081675583	Godey's magazine.	1850.0	3302.0
143022	nyp.33433006349736	A specimen of printing types, and ornaments, c...	1828.0	3297.0
151334	nyp.33433075730378	2835 Mayfair : a novel / by Frank Richardson.	1907.0	3290.0
154718	nyp.33433081810370	Colonial families of the Southern states of Am...	1911.0	3282.0

	id	title	date	count
111450	mdp.39015071886035	[Publications]	9999.0	38244.0
236936	wu.89059402255	Roster of the Confederate soldiers of Georgia,...	9999.0	33184.0
236938	wu.89059402289	Roster of the Confederate soldiers of Georgia,...	9999.0	30867.0
236937	wu.89059402263	Roster of the Confederate soldiers of Georgia,...	9999.0	25966.0
236940	wu.89059402313	Roster of the Confederate soldiers of Georgia,...	9999.0	24448.0
59460	mdp.39015002304221	Encyclopedia of American Quaker genealogy, by ...	9999.0	23178.0
236939	wu.89059402297	Roster of the Confederate soldiers of Georgia,...	9999.0	22978.0
111407	mdp.39015071884410	[Publications]	9999.0	19089.0
98058	mdp.39015051447657	Encyclopedia of American Quaker genealogy, by ...	9999.0	11257.0
111484	mdp.39015071888437	[Publications]	9999.0	11221.0
121935	mdp.39015095766716	Physical and biophysical foundations of pharma...	9999.0	10736.0
71128	mdp.39015011819037	The abridged compendium of American genealogy;...	9999.0	10528.0
70814	mdp.39015011488064	A lexicon of St. Thomas Aquinas based on the S...	9999.0	10446.0
238402	wu.89062853635	Bosworth genealogy; a history of the descendan...	9999.0	9023.0
111473	mdp.39015071887850	[Publications]	9999.0	8984.0
165127	pst.000023992122	Calendar of inquisitions miscellaneous, Chance...	9999.0	8819.0
111413	mdp.39015071884634	[Publications]	9999.0	8465.0
111441	mdp.39015071885722	[Publications]	9999.0	8418.0
131901	njp.32101067262574	Monumenta Ignatiana, ex autographis vel ex ant...	9999.0	7154.0
111393	mdp.39015071883651	[Publications]	9999.0	6924.0