BibTex Integration
After frustrating years of keeping track of research papers I have come to a solution that,
while perhaps not scalable, serves my needs.
I made the decision to archive papers in a daily-backed-up directory, while cross-referencing
them and their locations within a BibTex file.
There are good reasons to do this, one of which is the ease with which I can add citations in my work.
I have found
Jabref to be the best utility for that.
It is able to auto-link the archived .pdf with the reference, which can be scraped from the
arXiv
or from the citation export utilities of most journals.
I use the
LocalCopy
plugin for Jabref to download and save papers automatically. The default
behaviour wasn't to my liking because it broke Jabref's autolinking; you can
find a modified version of LocalCopy
here.
A neat side effect is the ability to parse my global BibTex file for my own publications and automatically insert them to my CV/webpage.
I use Python for this, with the aid of
Bibtex-py
, a
Python-based BibTex parser.
That link died sometime before September 2012, I think because the author
released a new parser at
bibliopy.
It chokes on multi-line inputs and isn't actively maintained, so I had to
make some changes to the source, but it doesn't come with a redistribution license
so I won't put the modified code up at the moment.
Email me if you're interested in what I did.
In order to dynamically link my resume, webpage, and BibTex file, I wrote a small Python script.
The features include:
- orders (and labels) the references from most recent to oldest
- easily modified to only export the X most recent articles
- automatically links open access identifiers to arXiv
- exports into both LaTeX and html
A note - my current resume LaTeX style file used is written by
Daniel Burrows
and modified slightly by me.
The second phase to this project was to be able to annotate research papers and easily access those notes.
This solution is under heavy development at the moment, but because of the open standard used by Xournal
and Xournalpp (see
xournalpp for details on that project), it is possible to
extract markup and organise it in a unified way. Because I use Vim, I have modified the working orgmode
script (from the
delta improvement
blog) to use Vim's
UTL package.
Stay tuned for details on this.
#!/usr/bin/python
"""Bibtex.py: A utility to parse, sort, and extract BibTex reference data."""
__author__ = "Wilson brenna"
import bibparse
import re
#NB this has been vastly improved since I put this on the web, but
#it's a large file and rather messy - contact me if you want it!
entries = bibparse.parse_bib('../papers/jabref')
filename1 = 'resumepubs.tex'
filename2 = '../public_html/htmlpubs.html'
f1 = open(filename1,'w')
f2 = open(filename2,'w')
j = 0
maxstring = {}
htmlstring = {}
for i in range(len(entries)-1):
if (entries[i].data['author']).startswith("W. G. Brenna"):
string1 = '\\affiliation[' + entries[i].data['author']
string2 = ': ``' + entries[i].data['title'] + '\'\' ]'
string3 = '{' + entries[i].data['journal'] + ' \emph{' + entries[i].data['volume'] + '} (' + entries[i].data['year'] + ')}{}'
maxstring[entries[i].data['year']+str(j)] = string1 + string2 + string3 + '\n'
#Write out this string to tex file
j = j + 1
string4 = entries[i].data['author'] + ', <i>' + entries[i].data['title'] + '</i>, '
string5 = entries[i].data['journal'] + ' <b>' + entries[i].data['volume'] + '</b> (' + entries[i].data['year'] + ')'
oai2= re.search('Oai2Identifier = {(.+)}', entries[i].export())
if oai2 != None:
htmlstring[entries[i].data['year']+str(j-1)] = string4 + string5 + ' <a href="http://arxiv.org/pdf/' + oai2.group(1) + '.pdf">arXiV:' + oai2.group(1) + '</a>\n'
else:
htmlstring[entries[i].data['year']+str(j-1)] = string4 + string5 + '\n'
#Write out this string to html file -
#the oai2 crashes if you search for it and it's not there, so you
#need to do this manually without error handling
j = 1
for key in sorted(maxstring.iterkeys(),reverse=True):
f1.write(maxstring[key])
f2.write('<p />[' + str(j) + '] ')
f2.write(htmlstring[key])
j = j + 1
f1.close()
f2.close()
#!/usr/bin/env python
license="""
Copyright (c) 2009, dalai@delta|improvement
All rights reserved.
Modified by Wilson 2013 wbrenna.ca
-changed to output VIM-UTL format instead of org-mode.
-altered the figure search to work with Xournalpp as well as Xournal colours.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* The name of the original author may not be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
"""
import xml.parsers.expat
import math
import gzip,tempfile
import os,sys,optparse
import atexit
from subprocess import *
# Prepare the option parser
parser = optparse.OptionParser(usage="usage: %prog [options] input.pdf", version="ProcessPaper 0.2 by dalai@delta|improvement")
parser.add_option("-l", "--license", dest="license", default=False, action="store_true", help="print license information and exit")
parser.add_option("-i", "--image", dest="image", default=False, help="get everything as image (good for scanned documents)", action="store_true")
parser.add_option("", "--overwrite", dest="overwrite", default=False, help="overwrite old files", action="store_true")
parser.add_option("-s", "--store", dest="imagedir", action="store", type="string", default="./figs/", help="directory to store the figure files. Default .")
parser.add_option("-o", "--ocr", dest="ocr", default=False, help="try to OCR the highlighted regions (implies -i)", action="store_true")
parser.add_option("-r", "--resolution", dest="resolution", type="int", default=300, help="PDF rasterization resolution for OCR in dpi. Default is 300dpi")
parser.add_option("-v", "--verbose", dest="verbose", default=False, help="print extra information on what is going on", action="store_true")
parser.add_option("-d", "--debug", dest="debug", default=False, help="print debugging info", action="store_true")
def checkstate(pcmd,out,err,process,debug):
if debug:
print >>sys.stderr, "DEBUG: Attempting to execute %s" %' '.join(pcmd)
if process.returncode:
print >>sys.stderr, "ERROR: %s" %err
print >>sys.stderr, "ERROR: Process exited with return code %i; quitting." % process.returncode
sys.exit(process.returncode)
def which(program):
"""Checks if a named executable exists on the $PATH"""
for path in os.environ["PATH"].split(os.pathsep):
fname = os.path.join(path, program)
if os.path.exists(fname) and os.access(fname, os.X_OK):
return True
return False
def start_element(name, attrs):
global cdata,figure,pagenum,strokewidth,pageheight
if name == "background":
pagenum = int(float(attrs['pageno']))
if name == "page":
pageheight = int(float(attrs['height']))
if name == "stroke":
cdata = ''
#print >>sys.stderr, attrs['width']
try:
strokewidth = float(attrs['width'])
#This now works with Xournalpp as well as Xournal.
if (attrs['color'] == "black") or (attrs['color'] == "#0000007f"):
#print attrs['color']
figure = True
else:
#print >>sys.stderr, attrs['color']
figure = False
except:
print >>sys.stderr, 'Could not find bounding width. Probably this is handwritten annotation. Just putting in a link.'
#strokewidth = float(min(attrs['width']))
strokewidth = 0
if name == "text":
cdata = ''
def end_element(name):
global cdata,pagenum,fignum,strokewidth,pageheight
if name == "stroke":
strokedata=cdata.lstrip().rstrip().split(" ")
# Calculate the bounding box
lx = int(math.floor(float(min(strokedata[::2],key=float))-strokewidth/2.0))
hx = int(math.floor(float(max(strokedata[::2],key=float))+strokewidth/2.0))
ly = pageheight - int(math.floor(float(max(strokedata[1::2],key=float))+strokewidth/2.0))
hy = pageheight - int(math.floor(float(min(strokedata[1::2],key=float))-strokewidth/2.0))
bbox = "\'%i %i %i %i\'" %(lx,ly,hx,hy)
# Get an appropriate image name
if options.image or figure:
fignum += 1
#Fix up the figures directory
#imagename = os.path.dirname(absinput) + "/%s/%s.fg%03i.pdf" %(options.imagedir,citation,fignum)
imagename = os.path.dirname(curdir) + "/%s/%s.fg%03i.pdf" %(options.imagedir,citation,fignum)
#tiffname = os.path.dirname(absinput) + "/%s/%s.fg%03i.tif" %(options.imagedir,citation,fignum)
tiffname = os.path.dirname(curdir) + "/%s/%s.fg%03i.tif" %(options.imagedir,citation,fignum)
if os.path.exists(imagename) and not options.overwrite:
print >>sys.stderr, "ERROR: \"%s\" exists. Will not overwrite." %imagename
sys.exit()
# Process PDF
try:
pcmd = ['pdftk', absinput, 'cat', str(pagenum), 'output',tfileA]
proc_page = Popen(pcmd, stdout=PIPE, stderr=PIPE)
out,err = proc_page.communicate()
checkstate(pcmd,out,err,proc_page,options.debug)
if options.image or figure:
pcmd = ['pdfcrop','--bbox',bbox,tfileA,imagename]
else:
pcmd = ['pdfcrop','--bbox',bbox,tfileA,tfileB]
proc_crop = Popen(' '.join(pcmd), stdout=PIPE, stderr=PIPE,shell=True)
out,err = proc_crop.communicate()
checkstate(pcmd,out,err,proc_crop,options.debug)
if options.image or figure:
if options.ocr and not figure:
#print '** [[pdf:%s@%i][%s|p%i]]' %(absinput,pagenum,citation,pagenum)
print '<URL:%s#%i>' %(absinput,pagenum)
pcmd = ['convert','-density',str(options.resolution),imagename,tiffname]
proc_2tif = Popen(pcmd, stdout=PIPE, stderr=PIPE)
out,err = proc_2tif.communicate()
checkstate(pcmd,out,err,proc_2tif,options.debug)
pcmd = ['tesseract',tiffname,tfileOCR]
proc_text = Popen(pcmd, stdout=PIPE, stderr=PIPE)
out,err = proc_text.communicate()
checkstate(pcmd,out,err,proc_text,options.debug)
ocrdatafile = open(tfileOCR+'.txt','r')
print ocrdatafile.read()
print ' Original image <URL:%s> Figure %i' %(imagename,fignum)
#print ' Original image [[%s][Figure %i]]' %(imagename,fignum)
os.unlink(tiffname)
else:
print ' Original image <URL:%s> Figure %i' %(imagename,fignum)
#print '** [[%s][%s|Figure %i]]' %(imagename,citation,fignum)
else:
pcmd = ['pdftotext','-nopgbrk',tfileB,'-']
proc_text = Popen(pcmd, stdout=PIPE, stderr=PIPE)
out,err = proc_text.communicate()
checkstate(pcmd,out,err,proc_text,options.debug)
#print '** [[pdf:%s@%i][%s|p%i]]' %(absinput,pagenum,citation,pagenum)
print '<URL:%s#%i>\n %s, p%i' %(absinput,pagenum,citation,pagenum)
print out
except OSError:
print >>sys.stderr, "ERROR: OS Error. This normally shouldn't have happened. Sorry."
sys.exit()
if name == "text":
title,sep,note=cdata.partition('\n')
#print '** [[pdf:%s@%i][%s|p%i|%s]]' %(absinput,pagenum,citation,pagenum,title)
print '<URL:%s#%i>\n %s' %(absinput,pagenum,title)
print note
cdata = ''
def char_data(data):
global cdata
cdata += data
# HERE STARTS THE MAIN PROGRAM
(options, args) = parser.parse_args()
if options.license:
print license
sys.exit()
# Check if everything that is needed is actually installed
for program in 'pdftk', 'pdfcrop', 'pdftotext':
if not which(program):
print >>sys.stderr, "ERROR: %s is needed but doesn't seem to be installed. Either install %s or check your PATH." % (program, program)
sys.exit(1)
if options.ocr:
for program in 'tesseract', 'convert':
if not which(program):
print >>sys.stderr, "ERROR: I need %s to do OCR. Either install %s or check your PATH." % (program, program)
sys.exit(1)
if len(args) < 1:
print >>sys.stderr, "ERROR: Missing input file"
sys.exit()
absinput = os.path.abspath(args[0])
citation = os.path.splitext(os.path.basename(absinput))[0]
#Fix up the figures directory
curdir = os.getcwd() + '/'
#print >>sys.stderr, curdir
if not os.path.isdir(options.imagedir):
if options.verbose:
print >>sys.stderr, "INFO: The directory %s does not exist. Creating one now." %options.imagedir
try:
os.makedirs(options.imagedir)
except OSError:
print >>sys.stderr, "ERROR: Could not create directory %s. Exiting." %options.imagedir
sys.exit()
# I need two temporary files
fa = tempfile.NamedTemporaryFile(delete=False)
fb = tempfile.NamedTemporaryFile(delete=False)
tfileA = fa.name
tfileB = fb.name
fa.close()
fb.close()
# If I'm doing OCR then make sure everything is grabbed as image
if options.ocr:
options.image = True
# One more if I'm doing OCR
if options.ocr:
focr = tempfile.NamedTemporaryFile(delete=False, suffix=".txt")
tfileOCR = os.path.splitext(focr.name)[0]
focr.close()
def cleanup():
os.unlink(fa.name)
os.unlink(fb.name)
if options.ocr:
os.unlink(focr.name)
atexit.register(cleanup)
# Some initial variables
cdata = ''
figure = False
pagenum = 0
fignum = 0
notenum = 0
strokewidth = 0
pageheight = 0
# Prepare the XML parser
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
if options.verbose:
print >>sys.stderr, "INFO: Opening Xournal file."
myfile = gzip.open(absinput+'.xoj', 'rb')
#print "* [[%s][Notes from %s]]"%(absinput,citation)
print "<URL:%s> [[Notes from %s]]"%(absinput,citation)
if options.verbose:
print >>sys.stderr, "INFO: Parsing XML."
p.ParseFile(myfile)
if options.verbose:
print >>sys.stderr, "INFO: Done."