pdf annotation extractor

Created By: pseudometa, 16 Stars, Last Updated: 23/11/21 21:44:55

Love this? Please consider supporting its creator by starring or sponsoring this project on GitHub!

From the project's README:

PDF Annotation Extractor (Alfred Workflow)

An Alfred Workflow to extract Annotations as Markdown & insert Pandoc Citations as References. Outputs Annotations to Obsidian, Drafts, PDF, Markdown file, or simply the clipboard.

Automatically determines correct page numbers, merges highlights across page breaks, prepends a YAML Header bibliographic information, and some more small Quality-of-Life things.

PDF Annotation Extractor

Table of Contents

How to Use

  • Use the hotkey to trigger the Annotation Extraction of the frontmost document of Preview or PDF Expert. In case Finder is the frontmost app, the currently selected PDF file will be used.

Automatic Page Number Identification

The correct page numbers will automatically be determined from one of three sources and inserted into the references as Pandoc Citations, with descending priority:

  1. Your BibTeX-Library
  2. DOI found in the PDF
  3. Prompt to manually enter the page number.
  • Enter the true page number of your first PDF page. Example: if the first PDF page represents the page number 104, you have to enter 104.
  • In case there is content before the actual text (e.g. a foreword or a Table of Contents), the first true page often occurs later in the PDF. In that case, you must enter a negative page number, reflecting the true page number the first PDF would have. Example: You PDF is a book which has a foreword, and uses roman numbers for it; true page number 1 is PDF page number 12. If you continued the numbering backwards, the first PDF page would have page number -10. So you enter the value -10 when prompted for a page number.

ℹ️ : This workflow only extracts free comments and highlights with comments. (Upcoming feature in 4.5 release: Underlines.)

Annotation Codes

Insert these special codes at the beginning of an annotation to invoke special actions on that annotation. (You can run the Alfred command acode to quickly display a cheat sheet showing all the following information.)

Highlights & Free Comments

  • + (highlights): Merge with previous highlight and puts a "(…)" in between. Used for jumping sections on the same page. If jumping across pages, both Pages will be included in the citation.
  • ++ (highlights): Merge with previous highlight. Used for continuing a highlight on the next page. Both Pages will be included in the citation.
  • ? foo (comments): Turns "foo" into h6 & move up to the top. (Used for Introductory Comments or Questions ("Pseudo-Admonitions")):
  • ## (highlights): Turns highlight into heading added at that location. Number of "#" determines the heading level.
  • ## foo (comments): Adds "foo" as heading at that location. Number of "#" determines the heading level.
  • X (highlights): Turns highlight into task and move up.
  • X foo (comments): Turns "foo" into task and move up.
  • !n foo (comments): Insert nth image taken with the image-hotkey at the location of the comment location. "n" being the number of images taken, e.g. "!3" for the third image. "foo" will be added as image alt-text (image label). (The Hotkey works only for Obsidian as output format.)
  • = (highlights): Adds highlight as keyword to the YAML-frontmatter. Removes the highlight afterwards
  • = foo (comments): Adds "foo" as keyword to the YAML-frontmatter. Removes the comment afterwards.
  • (upcoming) --- (free comments): Turns the comment into an markdown hr (---).

ℹ️ multi-line-annotations only work in highlights for now, but not yet in free comments.

Underlines (upcoming)

  • +: Merge with previous highlight and puts a "(…)" in between. Used for jumping sections on the same page. If jumping across pages, both Pages will be included in the citation.
  • ++: Merge with previous highlight. Used for continuing a highlight on the next page. Both Pages will be included in the citation.

Extra Features

  • When using Obsidian, the wikilink is also copied to the clipboard
  • With the output type set to Obsidian or Markdown, a YAML-Header with bibliographic information (author, title, citekey, year, keywords, etc.) is also prepended.
  • When manually entering the number of the first page, negative page numbers are accepted. This is useful for books and reports where there are some PDF pages before the first page, e.g. due to a preface.
  • Upcoming: Underlines result in a second output document.

Requirements & Installation

Requirements

  • Alfred (Mac only)
  • Alfred Powerpack (~30€)
  • References saved as BibTeX-Library (.bib)

Install the following Third-Party-Software

Don't be discouraged if you are not familiar with the Terminal. Just copy-paste the following code into your Terminal and press enter – there is nothing more you have to do. (It may take a moment to download and install everything.)

# Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Python3
brew install python3

# Install pip3
curl -s https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
rm get-pip.py

# CLIs needed for Annotation Extraction
pip3 install pdfminer.six
pip3 install pdfannots
brew install pdfgrep

Download & Install the PDF Annotation Extractor Workflow

Define the Hotkey by double-clicking this field

Set Hotkey

Set BibTeX Library Path

  • using the aconf command, select Set BibTeX Library, and then search/select your .bib file

Further steps only required for specific output types

  • Obsidian as Output: Use the aconf command, select Obsidian Destination, and then search/select the folder .
  • PDF as Output Format: Install Pandoc and a PDF-Engine of your choice.
brew install pandoc
brew install wkhtmltopdf # can be changed to a pdf-engine of your choice

Configuration

Use the Alfred keyword aconf for the configuration of this workflow.

  • the output format (PDF, Markdown, Clipboard, Drafts, or Obsidian). When selecting Markdown or Obsidian as output format, a YAML-Header with information from your BibTeX Library will be prepended.
  • set whether citekeys should be entered manually or determined automatically via filename. The latter requires a filename beginning with the citekey, followed by an underscore:[citekey]_[...].pdf. You can easily achieve such a filename pattern with via renaming rules of most reference managers, for example with the ZotFile plugin for Zotero.
  • the Obsidian destination (must be a folder in your vault)

Troubleshooting

  • Upgrade to the newest version of pdfannots: pip3 install --upgrade pdfannots
  • This workflow won't work with annotations that are not actually saved in the PDF file. Some PDF Readers like Skim do this, but you can tell those PDF readers to save the notes in the actual PDF.
  • The workflow sometimes does not work when the pdf contains bigger free-form annotations (e.g. from using a stylus on a tablet to). Delete all annotations that are "image" or "free form" and the workflow should work again.
  • Do not use backticks (`) in any type of comment – this will break the annotation extraction.
  • When the hotkey does not work in Preview, most likely the Alfred app does not have permissions to access Preview. You can give Alfred permission in the Mac OS System Settings.Permission for Alfred to access Preview

➡️ When you cannot resolve the problem, please open an GitHub issue. Be sure to include screenshots and/or a debugging log, as I will not be able to help you otherwise. You can get a debugging log by opening the workflow in Alfred preferences and pressing cmd + D. A small window will open up which will log everything happening during the execution of the Workflow. Use the malfunctioning part of the workflow once more, copy the content of the log window, and attach it as text file.

Credits

Thanks

  • Thanks to Andrew Baumann for pdfannots without which this Alfred Workflow would not be possible.
  • Thanks to @StPag from the Obsidian Discord Server for his ideas on annotation codes.

Donations 🙏

About the Author

This workflow has been created by @pseudo_meta (Twitter) aka Chris Grieser (rl). In my day job, I am a PhD student in sociology, studying the governance of the app economy. If you are interested in this subject, check out my academic homepage and get in touch.