Deduplicating References

Synthesis automatically de-duplicates references, and provides a workflow for manually verifying and editing duplicate designations.

How Synthesis De-Duplicates References

Synthesis automatically de-duplicates references imported into a project. It uses the following general rules to determine duplicates:

  • If any two references have different publication years, they are by definition not considered duplicates, regardless of how similar their titles, authors, journal, abstracts etc. are
  • References with matching titles (>= 95% similarity) and either matching authors (>= 75%) or matching journals are considered duplicates
  • References with matching abstracts (>= 90% similar) are considered duplicates

What Synthesis does with Duplicate References

References imported into Synthesis are handled in the following ways:

  • Abstracts imported - non-duplicate references are imported into your Synthesis project
  • Precise matches (skipped) - references that are 100% similar are just skipped and not stored in Synthesis at all. To be considered a precise match, they must meet the requirements above and come from the same source database (i.e. 1st abstract came from PubMed and 2nd Abstract came from PubMed). This allows you to update the Synthesis project file and exclude all the previous imports from the same library database (i.e. PubMed).
  • Duplicate abstracts (imported) - references that appear to be the same (as described above) are stored in Synthesis. These references do not appear in Synthesis except as part of the duplicate reference count on the Stats tab

Manually Verifying Duplicates in Synthesis

You can view and verify duplicates in Synthesis by entering "Duplicates Mode" from the Project Settings Dialog. See "Duplicates Mode" for more information.

Background and Example

De-duplicating references is a tricky problem. For example, each bibliographical database (e.g. Embase, Web of Science, PubMed, etc.) seems to have a different way of entering the journal articles into their databases. This can result in issues such as slight spelling mistakes, entering a symbol such as "<=" in one database and writing "less than or equals to" in another database, or small variations between references such as the same abstract but changing one thing such as the year "we report on data from 1989".

One of the goals for Synthesis is to address updating of literature reviews. According to the Cochrane Collaboration, Cochrane Intervention reviews should be updated within two years as "Systematic reviews that are not maintained may become out of date or misleading" (source: The Cochrane Collaboration. “The Cochrane Book Series” Imprint.Published by John Wiley & Sons, Ltd. 2008).

For example, if we were to import a reference file from PubMed using the query "National Health Interview Survey" (extracted September 12, 2012) we would get the following (the right column is showing information contained in the Stats tab after importing):

Figure: Initial reference import, showing abstracts imported and duplicates found (and imported)

Figure: Stats Tab after the first reference import

If you were to re-import the exact same reference file you would get the following (the right column is showing information contained in the Stats tab after importing):

Figure: Second reference import, showing discarded (i.e. identical) duplicates

Figure: Stats tab after second reference import - same # of unique papers, but increase in duplicates

Note: Note: the import takes longer to import the second time as Synthesis needs to compare the reference being imported with all the other abstracts already imported into the Synthesis project file.
In the second import, notice that the Unique Papers are the same, however, Excluded Duplicates have increased.
Note: To be corrected in a future release of the software.