BibTex Integration

After frustrating years of keeping track of research papers I have come to a solution that, while perhaps not scalable, serves my needs. I made the decision to archive papers in a daily-backed-up directory, while cross-referencing them and their locations within a BibTex file. There are good reasons to do this, one of which is the ease with which I can add citations in my work. I have found Jabref to be the best utility for that. It is able to auto-link the archived .pdf with the reference, which can be scraped from the arXiv or from the citation export utilities of most journals.

I use the LocalCopy plugin for Jabref to download and save papers automatically. The default behaviour wasn't to my liking because it broke Jabref's autolinking; you can find a modified version of LocalCopy here.

A neat side effect is the ability to parse my global BibTex file for my own publications and automatically insert them to my CV/webpage. I use Python for this, with the aid of Bibtex-py , a Python-based BibTex parser. That link died sometime before September 2012, I think because the author released a new parser at bibliopy. It chokes on multi-line inputs and isn't actively maintained, so I had to make some changes to the source, but it doesn't come with a redistribution license so I won't put the modified code up at the moment. Email me if you're interested in what I did.

In order to dynamically link my resume, webpage, and BibTex file, I wrote a small Python script. The features include:

A note - my current resume LaTeX style file used is written by Daniel Burrows and modified slightly by me.

The second phase to this project was to be able to annotate research papers and easily access those notes. This solution is under heavy development at the moment, but because of the open standard used by Xournal and Xournalpp (see xournalpp for details on that project), it is possible to extract markup and organise it in a unified way. Because I use Vim, I have modified the working orgmode script (from the delta improvement blog) to use Vim's UTL package. Stay tuned for details on this.


#!/usr/bin/python
"""Bibtex.py: A utility to parse, sort, and extract BibTex reference data."""
__author__	=	"Wilson brenna"
import bibparse
import re

#NB this has been vastly improved since I put this on the web, but
#it's a large file and rather messy - contact me if you want it!

entries = bibparse.parse_bib('../papers/jabref')

filename1 = 'resumepubs.tex'
filename2 = '../public_html/htmlpubs.html'

f1 = open(filename1,'w')
f2 = open(filename2,'w')

j = 0
maxstring = {}
htmlstring = {}

for i in range(len(entries)-1):
	if (entries[i].data['author']).startswith("W. G. Brenna"):
		string1 = '\\affiliation[' + entries[i].data['author']
		string2 = ': ``' +  entries[i].data['title'] + '\'\' ]'
		string3 = '{' + entries[i].data['journal'] + ' \emph{' +  entries[i].data['volume'] + '} (' + entries[i].data['year'] + ')}{}'
		maxstring[entries[i].data['year']+str(j)] = string1 + string2 + string3 + '\n'
#Write out this string to tex file
		j = j + 1
		string4 = entries[i].data['author'] + ', <i>' +  entries[i].data['title'] + '</i>, '
		string5 = entries[i].data['journal'] + ' <b>' + entries[i].data['volume'] + '</b> (' + entries[i].data['year'] + ')'

		oai2= re.search('Oai2Identifier = {(.+)}', entries[i].export())
		if oai2 != None:
			htmlstring[entries[i].data['year']+str(j-1)] = string4 + string5 + ' <a href="http://arxiv.org/pdf/' + oai2.group(1) + '.pdf">arXiV:' + oai2.group(1)  + '</a>\n'
		else:
			htmlstring[entries[i].data['year']+str(j-1)] = string4 + string5 + '\n'
#Write out this string to html file - 
#the oai2 crashes if you search for it and it's not there, so you
#need to do this manually without error handling


j = 1

for key in sorted(maxstring.iterkeys(),reverse=True):
	f1.write(maxstring[key])
	f2.write('<p />[' + str(j) + '] ')
	f2.write(htmlstring[key])
	j = j + 1

f1.close()
f2.close()


#!/usr/bin/env python
license="""
Copyright (c) 2009, dalai@delta|improvement
All rights reserved.

Modified by Wilson 2013 wbrenna.ca
	-changed to output VIM-UTL format instead of org-mode.
	-altered the figure search to work with Xournalpp as well as Xournal colours.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
    * The name of the original author may not be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"""

import xml.parsers.expat
import math
import gzip,tempfile
import os,sys,optparse
import atexit
from subprocess import *

# Prepare the option parser
parser = optparse.OptionParser(usage="usage: %prog [options] input.pdf", version="ProcessPaper 0.2 by dalai@delta|improvement")
parser.add_option("-l", "--license", dest="license", default=False, action="store_true", help="print license information and exit")
parser.add_option("-i", "--image", dest="image", default=False, help="get everything as image (good for scanned documents)", action="store_true")
parser.add_option("", "--overwrite", dest="overwrite", default=False, help="overwrite old files", action="store_true")
parser.add_option("-s", "--store", dest="imagedir", action="store", type="string", default="./figs/", help="directory to store the figure files. Default .")
parser.add_option("-o", "--ocr", dest="ocr", default=False, help="try to OCR the highlighted regions (implies -i)", action="store_true")
parser.add_option("-r", "--resolution", dest="resolution", type="int", default=300, help="PDF rasterization resolution for OCR in dpi. Default is 300dpi")
parser.add_option("-v", "--verbose", dest="verbose", default=False, help="print extra information on what is going on", action="store_true")
parser.add_option("-d", "--debug", dest="debug", default=False, help="print debugging info", action="store_true")

def checkstate(pcmd,out,err,process,debug):
	if debug:
		print >>sys.stderr, "DEBUG: Attempting to execute %s" %' '.join(pcmd)
	if process.returncode:
		print >>sys.stderr, "ERROR: %s" %err
		print >>sys.stderr, "ERROR: Process exited with return code %i; quitting." % process.returncode
		sys.exit(process.returncode)

def which(program):
	"""Checks if a named executable exists on the $PATH"""
	for path in os.environ["PATH"].split(os.pathsep):
		fname = os.path.join(path, program)
		if os.path.exists(fname) and os.access(fname, os.X_OK):
			return True
	return False

def start_element(name, attrs):
    global cdata,figure,pagenum,strokewidth,pageheight
    if name == "background":
        pagenum = int(float(attrs['pageno']))
    if name == "page":
        pageheight = int(float(attrs['height']))
    if name == "stroke":
        cdata = ''
	#print >>sys.stderr, attrs['width']
	try:
		strokewidth = float(attrs['width'])
#This now works with Xournalpp as well as Xournal.
		if (attrs['color'] == "black") or (attrs['color'] == "#0000007f"):
			#print attrs['color']
			figure = True
		else:
			#print >>sys.stderr, attrs['color']
			figure = False
	except:
		print >>sys.stderr, 'Could not find bounding width. Probably this is handwritten annotation. Just putting in a link.'
		#strokewidth = float(min(attrs['width'])) 
		strokewidth = 0 
    if name == "text":
        cdata = ''

def end_element(name):
    global cdata,pagenum,fignum,strokewidth,pageheight
    if name == "stroke": 
        strokedata=cdata.lstrip().rstrip().split(" ")
	# Calculate the bounding box
        lx = int(math.floor(float(min(strokedata[::2],key=float))-strokewidth/2.0))
        hx = int(math.floor(float(max(strokedata[::2],key=float))+strokewidth/2.0))
        ly = pageheight - int(math.floor(float(max(strokedata[1::2],key=float))+strokewidth/2.0))
        hy = pageheight - int(math.floor(float(min(strokedata[1::2],key=float))-strokewidth/2.0))
        bbox = "\'%i %i %i %i\'" %(lx,ly,hx,hy)
	# Get an appropriate image name
        if options.image or figure:
            fignum += 1
#Fix up the figures directory
            #imagename = os.path.dirname(absinput) + "/%s/%s.fg%03i.pdf" %(options.imagedir,citation,fignum)
            imagename = os.path.dirname(curdir) + "/%s/%s.fg%03i.pdf" %(options.imagedir,citation,fignum)
            #tiffname = os.path.dirname(absinput) + "/%s/%s.fg%03i.tif" %(options.imagedir,citation,fignum)
            tiffname = os.path.dirname(curdir) + "/%s/%s.fg%03i.tif" %(options.imagedir,citation,fignum)
	    if os.path.exists(imagename) and not options.overwrite:
		    print >>sys.stderr, "ERROR: \"%s\" exists. Will not overwrite." %imagename
		    sys.exit()
        # Process PDF
        try:
            pcmd = ['pdftk', absinput, 'cat', str(pagenum), 'output',tfileA]
            proc_page = Popen(pcmd, stdout=PIPE, stderr=PIPE)
            out,err = proc_page.communicate()
            checkstate(pcmd,out,err,proc_page,options.debug)
            if options.image or figure:
                pcmd = ['pdfcrop','--bbox',bbox,tfileA,imagename]
            else:
                pcmd = ['pdfcrop','--bbox',bbox,tfileA,tfileB]
            proc_crop = Popen(' '.join(pcmd), stdout=PIPE, stderr=PIPE,shell=True)
            out,err = proc_crop.communicate()
            checkstate(pcmd,out,err,proc_crop,options.debug)
            if options.image or figure:
		    if options.ocr and not figure:
			    #print '** [[pdf:%s@%i][%s|p%i]]' %(absinput,pagenum,citation,pagenum)
			    print '<URL:%s#%i>' %(absinput,pagenum)
			    pcmd = ['convert','-density',str(options.resolution),imagename,tiffname]
			    proc_2tif = Popen(pcmd, stdout=PIPE, stderr=PIPE)
			    out,err = proc_2tif.communicate()
			    checkstate(pcmd,out,err,proc_2tif,options.debug)
			    pcmd = ['tesseract',tiffname,tfileOCR]
			    proc_text = Popen(pcmd, stdout=PIPE, stderr=PIPE)
			    out,err = proc_text.communicate()
			    checkstate(pcmd,out,err,proc_text,options.debug)
			    ocrdatafile = open(tfileOCR+'.txt','r')
			    print ocrdatafile.read()
			    print ' Original image <URL:%s> Figure %i' %(imagename,fignum)
			    #print ' Original image [[%s][Figure %i]]' %(imagename,fignum)
			    os.unlink(tiffname)
		    else:
			    print ' Original image <URL:%s> Figure %i' %(imagename,fignum)
			    #print '** [[%s][%s|Figure %i]]' %(imagename,citation,fignum)
            else:
                pcmd = ['pdftotext','-nopgbrk',tfileB,'-']
                proc_text = Popen(pcmd, stdout=PIPE, stderr=PIPE)
                out,err = proc_text.communicate()
                checkstate(pcmd,out,err,proc_text,options.debug)
                #print '** [[pdf:%s@%i][%s|p%i]]' %(absinput,pagenum,citation,pagenum)
		print '<URL:%s#%i>\n %s, p%i' %(absinput,pagenum,citation,pagenum)
                print out
        except OSError:
            print >>sys.stderr,  "ERROR: OS Error. This normally shouldn't have happened. Sorry."
            sys.exit()
    if name == "text":
        title,sep,note=cdata.partition('\n')
        #print '** [[pdf:%s@%i][%s|p%i|%s]]' %(absinput,pagenum,citation,pagenum,title)
	print '<URL:%s#%i>\n %s' %(absinput,pagenum,title)
        print note
    cdata = ''

def char_data(data):
    global cdata
    cdata += data

# HERE STARTS THE MAIN PROGRAM

(options, args) = parser.parse_args()

if options.license:
	print license
	sys.exit()

# Check if everything that is needed is actually installed
for program in 'pdftk', 'pdfcrop', 'pdftotext':
	if not which(program):
		print >>sys.stderr, "ERROR: %s is needed but doesn't seem to be installed. Either install %s or check your PATH." % (program, program)
		sys.exit(1)
if options.ocr:
	for program in 'tesseract', 'convert':
		if not which(program):
			print >>sys.stderr, "ERROR: I need %s to do OCR. Either install %s or check your PATH." % (program, program)
			sys.exit(1)

if len(args) < 1:
	print >>sys.stderr,  "ERROR: Missing input file"
	sys.exit()
absinput = os.path.abspath(args[0])
citation = os.path.splitext(os.path.basename(absinput))[0]
#Fix up the figures directory
curdir = os.getcwd() + '/'
#print >>sys.stderr, curdir

if not os.path.isdir(options.imagedir):
	if options.verbose:
		print >>sys.stderr, "INFO: The directory %s does not exist. Creating one now." %options.imagedir
	try:
		os.makedirs(options.imagedir)
	except OSError:
		print >>sys.stderr, "ERROR: Could not create directory %s. Exiting." %options.imagedir
		sys.exit()

# I need two temporary files
fa = tempfile.NamedTemporaryFile(delete=False)
fb = tempfile.NamedTemporaryFile(delete=False)
tfileA = fa.name
tfileB = fb.name
fa.close()
fb.close()

# If I'm doing OCR then make sure everything is grabbed as image
if options.ocr:
	options.image = True
# One more if I'm doing OCR
if options.ocr:
	focr = tempfile.NamedTemporaryFile(delete=False, suffix=".txt")
	tfileOCR = os.path.splitext(focr.name)[0]
	focr.close()
def cleanup():
	os.unlink(fa.name)
	os.unlink(fb.name)
	if options.ocr:
		os.unlink(focr.name)
atexit.register(cleanup)

# Some initial variables
cdata = ''
figure = False
pagenum = 0
fignum = 0
notenum = 0
strokewidth = 0
pageheight = 0

# Prepare the XML parser
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data

if options.verbose:
	print >>sys.stderr, "INFO: Opening Xournal file."
myfile = gzip.open(absinput+'.xoj', 'rb')
#print "* [[%s][Notes from %s]]"%(absinput,citation)
print "<URL:%s> [[Notes from %s]]"%(absinput,citation)
if options.verbose:
	print >>sys.stderr, "INFO: Parsing XML."
p.ParseFile(myfile)
if options.verbose:
	print >>sys.stderr, "INFO: Done."