Home Newsletters Issue 5, October 2007 VUE as Front End Database

The Vemco Monitor provides customers, researchers and biologists with up-to-date information on
new fish tracking and monitoring products and research and development activities from Vemco

Issue 5, October 2007

VUE's Effectiveness as a Database Front End - A Personal Experience

By Doug Pincock, PhD

Introduction

For a number of years, many users have been looking for software that would better support application databases. The file-oriented organization1 of the legacy VR2PC software creates significant data issues, particularly when a large number of receivers are involved. These include, but are certainly not restricted to:

  • The only connection between the actual detections, the receiver on which they occurred and how that receiver was configured is in the header. This gets in the way of a smooth import (into Excel or Access, for example) and, unless you develop a careful strategy to maintain the connection, it can be easily lost.


  • For a variety of reasons it is inevitable that many detections will appear in more than one log file. Eliminating the resulting duplicates from a database is not straightforward.

The most noticeable aspect of Vemco User Environment (VUE) software is the movement towards a single user interface for all receivers and support for high speed wireless upload. However, I believe that the other major feature, the ability to "combine data from multiple receivers of varying types into a single integrated database", will be far more significant because of its impact on both data quality and the time and effort necessary to get data into an application database.

Objectives

While others have been putting VUE through its paces in the field, I wanted to learn something of its effectiveness as a database front end - and in particular when much of the data involved is from legacy VR2 log files instead of VRL files which one gets from receivers downloaded by VUE.

To help with this, the Pacific Ocean Shelf Project (POST) and Kintama Research agreed to provide temporary and confidential access to their receiver log files (VR2 and VR3) so that I would have data representative of what would be developed in a multi-user, multi-year study.

My objectives were to:

  • Learn something of the performance and limits of VUE as the detection database grows.


  • Use this large database of detections - many obtained in difficult acoustic areas - to prototype algorithms for isolating and eliminating false positives.


  • Demonstrate the ability to provide measures of detection efficiency of lines from the detection database.


  • Recommend next steps including useful enhancements for the next release of VUE.

Benefits to POST include the elimination of false positives and support for their efforts to, first, validate detections of transmitters belonging to Non-POST researchers and, second, provide a means to put these researchers in contact with POST subject of course to not releasing information on Tag ownership without the owner's consent.

Methodology Overview


Figure 1: Methodology Overview

Figure 1 shows how one might arrange a large project. Details of VUE software features and how it works can be found in the VUE manual.

For this work, the starting point was the collection of all VR2 Log Files gathered over the three year period from 2004 to 2007. With this in hand, the steps followed were:

  1. Import all VR2 Files into VUE, this immediately created a detection database (.VDB Format) as well as a complete directory of VRL files which become the project archive.


  2. In creating the VDB database, three vital functions are automatically performed:
    • Elimination of all duplicate detections.
    • Connection of detection with the receiver detecting it.
    • Assigning unambiguous Code Map designation to Tag IDs thus eliminating possible confusion from the legacy Sub Map A, B, C D terminology.

    Other aspects of data quality can also be addressed here including assignment of station names, correction for receiver time drift, etc.


  3. The VDB database can now be exported to any database (or other application) that supports Comma Separate Value (CSV) format.


  4. Once the detections are in a general purpose database one can easily merge other relevant study data - in this case spreadsheets containing:
    • Tag Data: Release date and location, life, ownership, species and stock, surgery technique, etc. etc.
    • Receiver Data: Location, Deployment Data, Recovery Date, etc.
    • Receiver Performance Information: This was automatically extracted from the imported logs and is available as a separate export from VUE.

Results Overview

Building the Database
The exercise of importing the logs (over 800 in all) and exporting them to Access took about an hour on a moderately powered PC (2.4 GHz Pentium 4 with 2 GBytes RAM) and resulted in a database of roughly 3.5 million unique detections - recall that all duplicates resulting from overlap in many log files are automatically eliminated at this point. Tag and Receiver information were quickly added by import of spreadsheets and appropriate joins.

I did some timing tests with importing various numbers of log files and the time taken seemed to increase roughly linearly with the number of detections involved - an indication that there should be no issue dealing with much larger databases than this.

Analyzing the Data
With the exception of false positive rejection discussed below, I was now in a position to write queries to answer typical questions. For the purpose of this work, typical query results included:

  • Time of arrival of each tag at each receiving line.


  • POST and Non POST Tags detected along with ownership and other information (where available).


  • Number of receivers each tag was detected by as it crossed each line (input into validation of detection efficiency of each line).

At this point detailed work has been completed only on 2004 data so all quantitative results that follow are based on this.

Acceptance Tests and False Positives
Before discussing suitable tests to filter out false positives (if any), an aside on false positives is in order.

Because of the nature of the medium for underwater acoustics, communication errors cannot be avoided regardless of the communication method used. With respect to detection of Tagged Fish, the key then is to find the right trade off between:

  • Aggressive signalling with higher transmission rate of codes accompanied by higher communication error rate.


  • Conservative signalling with a low transmission rate and low communication error rate.

The tradeoffs do not end there; one also has to take into account the probability of False Positives - i.e. a communication error which is not detected by the receiver's error checking algorithm. Regardless of the coding scheme used, False Positive probability can never be zero. Therefore, since the existence of False Positives would cast doubt on all detections, there needs to be a mechanism to filter these out2.

VEMCO coding is biased towards conservative signalling with an accompanying low rate of False Positives.

With VEMCO coding, False Positives can be created when propagation conditions or collisions cause an apparently valid sequence which is not rejected by the error checking algorithm (See VEMCO White Paper for details). Such occurrences are rare and the basis for their rejection is that the probability of two False Positives having the same code within a short time is virtually zero3.

The normal test for eliminating these False Positives has been to reject any Tag Code which is not seen twice within a period of time (usually measured in hours). How long does this period need to be? To show the sensitivity of this period, I ran various tests with the 2004 POST data.

Of course one cannot say with certainty that a detection of a particular tag is a false positive but more than one of these symptoms is a strong indication:

  • Long intervals between all detections or only a single detection.


  • Not accepted by any other receiver in the array.


  • Time of arrival and/or departure inconsistent with other information.

Using the above to identify false positives, I obtained the results shown in Table 1 where:

  • Detection is defined as the detection of a particular Tag at a particular receiver.


  • The total number of potential detections are all those where an ID was seen at least twice at the receiver in question.


  • Potential False Positives are those exhibiting at least two of the symptoms described above.
Minimum Interval
Detections Accepted
Detections Rejected
Potential False Positives
4 Minutes
2205
98
0
30 Minutes
2266
36
0
1 Hour
2269
34
1
12 Hours
2284
17
10
Table 1. Dependence of Number of Detections
(i.e. a particular tag at a particular receiver) accepted
with Minimum Interval between Detections requirement.

The results in Table 1 show that the use of an accepting interval greater than 30 minutes is risky for this data. Points to note from the table include:

  • The 12 hours interval is typical of what is used by many researchers (and has been recommended by VEMCO in the past). However, recent research has shown that this is not always strict enough since the mechanisms producing certain types of false positive are not uniformly random.


  • The parameters of an appropriate test to use are data-dependent - factors such as Delay of Tags used and probability of collisions at receivers being the key factors. This will need to be taken into account in any automated tool that might be developed.

Can We Accept More Detections?
The simple answer is yes. If we accept based on the 30 minute interval, we will have rejected 25 detections which are likely to be valid (i.e. 36 rejected in all of which 11 show symptoms of being false positives). Extraction of the valid detections needs to take into account more than just the timing between successive detections. This can be illustrated by an examination of the two additional detections when the minimum interval requirement is relaxed to 1 hour. Both occurred at the same receiver - one with a significant number of other tags present leading to lower detection rates for each tag and a higher probability of false positives.

  • One ID detected 7 times over a 45 day period with one pair 50 minutes apart and the others 1 to 30 days apart. This ID was seen by no other receiver4. This is almost certainly a false positive5.


  • One ID detected twice in 40 minutes and accepted by two other receivers on the same line at a consistent time and four receivers on other lines. This is undoubtedly a valid detection.

One can also consider accepting single detections under two conditions:

  1. Detection at roughly the same time by adjacent receiver(s).


  2. Detection by another receiver in the array before or after the candidate detection consistent with fish swimming speed.

Acceptance Test Summary
Table 2 summarizes acceptance testing results including acceptance of the additional detections as described above including qualifying single detections.

Acceptance Test Summary
2004
Total Transmissions Detected
360,817
  Accepted
360,662
  Likely False Positive
155 (0.04%)
Tag ID Receiver Combinations
2682
  Accepted
2560
  Not Accepted
122
Table 2. Acceptance Test Summary


Comments

Efficiency of POST Arrays and Vemco Coding Scheme

  1. With the reasonably strict acceptance criteria behind Table 2, only 0.05% of the detections were flagged as possible false positive. Even stricter criteria would cause this number to rise - but not by much.


  2. The POST lines appear to detect virtually 100% of the fish passing as evidenced by:
    • 96% of Tag ID/receiver combinations accepted were also accepted by adjacent receivers on the same line.
    • 100% of the POST IDs detected (even once) were detected by other receivers in the system.

Using VUE as a Platform

  1. VUE makes it easy to both merge VR2 log data with VR2W and to get all data - old and new - into the same data base. Users should seriously consider going this way rather than converting VR2W logs to the old format.


  2. Without going into details, the most time consuming aspects of this work were a result of not being careful enough during import of legacy VR2 files for things like:
    • Ensuring that all time zone offsets were correct.
    • Not assigning Receiver Station in cases where a given receiver was at more than one location.
    • Not assigning different identifiers to Tags with the same ID but deployed in different years6.

    VUE provides excellent support for all of this but you have to use it.

Database Software

I went into this with only a nodding acquaintance with database software; so if you are in the same position, you might benefit from what I learned:

  1. Use of database software is much faster than Excel (assuming that you can even fit within Excel's 64k row limit. Time and effort was wasted whenever I used Excel for anything other than formatting a final report.


  2. Microsoft Access is easy to get started with but has some less desirable aspects as one moves along. These include:
    • Things stop working as the database gets bigger. For example, I could not add an index to a table with much more than 1.5 million detections.
    • Access supports a very limited subset of SQL - and some of the missing features are essential for the queries one would like to make.

  3. I used MySQL for the analysis described here but expect to switch to SQLite.

Next Steps

This work has set the stage for some activities which we hope will benefit the user community. These include:

  1. Development of an Application Note which will guide users through a process to detect and reject any false positives in their data.


  2. Incorporation into VUE or a separate post processing tool of false positive rejection algorithms.


  3. Investigation of an on-line registry which will allow users to submit log files containing detections of tags unknown and to pair these with the owner (subject to maintaining all required confidentiality).

Footnotes
1 i.e. One file for each receiver download.

2 It might seem contradictory to say that False Positives must occur but then to say they can be subsequently filtered out. However, the receiver makes its decision based only on what is sees as the signal arrives. Post processing can look at the entire data set to look for False Positive symptoms in each Tag ID identified.

3 The Vemco White Paper gives an indication of the statistics of this. A more detailed analysis will be presented in a future document.

4 Interestingly, this ID does correspond to a POST ID pointing out the danger of assuming an ID is valid just because it is amongst the ones released.

5 An obvious question is how an ID can appear multiple times and still be a False Positive. Explanation of this is beyond the scope of this article and will be covered in a future Application Note.

6 This is largely a legacy issue as we are no longer reusing IDs for pingers.

 




Copyright © 2008 VEMCO (A division of AMIRIX Systems Inc.)