HOME >> PROJECTS >> BVPROFILER >> DOCUMENTATION

BVProfiler Documentation

Requirements, Features, Limitations, Examples

Documentation is always a work in progress as it is only useful when users are actually able to benefit from it.

We have compiled here the information that we believe will be of practical use. If you find that something merits being included or discussed in more detail, contact us at info@sequencepublishing.com and we will adjust this document as appropriate.



Table of Contents





1. What is BVProfiler? [^]

BVProfiler ("Basic Vocabulary Profiler") is a small software utility designed to provide linguists, language researchers and educators with a simple-to-use, uncomplicated, and fully automated text profiler.



2. System Requirements [^]

BVProfiler has been developed for the Windows 95/98/ME/2000/NT/XP operating systems.



2.1 UNICODE File Encoding [^]

BVProfiler can handle files - wordlist(s) and text(s) - using either ASCII or UNICODE encodings, that is, with and without extended character sets. In this respect, BVProfiler is language neutral.

Not all languages can be profiled however, as some languages do not employ orthographic markers to identify word boundaries. Read Section 3.2.1 for a discussion on the detection of orthographic word boundaries.

BVProfiler automatically detects whether a file is encoded in ASCII or in UNICODE and generates reports in the appropriate encoding.

Note that all files - wordlist(s) and text(s) - used in a given run of the Profiling Engine must use the same encoding (ASCII or UNICODE). Otherwise, a warning message will alert the user and profiling will be automatically aborted.



2.2 Performance Considerations [^]

BVProfiler requires system resources (CPU and RAM) in direct proportion to the size of the wordlist(s) and text(s) used in a given run of the Profiling Engine. The amount of off-listed words is also a factor in RAM usage as the Profiling Engine keeps track of these words as well as a fragment of the line where they appear in the corresponding text.

This document includes the step-by-step operation of an example case (Section 5). Peruse it for specific data regarding resource usage for a rather massive batch-job.



2.3 Installing [^]

If the package was downloaded as a zip file, unzip the package where desired and simply run the program 'BVProfiler.exe' (if unzipped from the command line, supply the parameter -d).

If the package was downloaded as a setup executable file, run the setup program and follow the instructions.



2.4 Uninstalling [^]

If the package was downloaded as a zip file, simply remove the directory where BVProfiler was installed.

If the package was downloaded as a setup executable file, run the uninstall utility provided or through the 'Add/Remove Programs'.



3. Program Features [^]

BVProfiler has been designed with simplicity of use in mind. Essentially, all that is required to profile text(s) is to load the file(s) containing the text(s), load the wordlist(s) to profile against, and press 'Run Profiler'. The entire process is fully automatic.



3.1 Graphical User Interface [^]

Below is a screenshot of BVProfiler while working on the example case discussed in Section 5.

BVProfiler

A control by control description of the GUI (Graphical User Interface) follows:

- Button 'Add Files' under the label 'Word Lists': Brings up the standard File Open dialog to allow browsing and selection of wordlist files.

- Button 'Remove Selected' under the label 'Word Lists': Deletes those files currently selected in the wordlist listbox.

- Button 'Delete All' under the label 'Word Lists': Clears the contents of the wordlist listbox.

- Upper-right listbox (wordlists listbox): Displays the path and filename of wordlist files. The wordlist listbox allows direct drag'n'drop of files. Also, a right-click context menu allows the user to modify the order of the entries as well as the operations 'Remove Entry ...\', 'Remove Selected', and 'Delete All'.

- Button 'Add Files' under the label 'Files to Profile': Brings up the standard File Open dialog to allow browsing and selection of files containing the texts to profile.

- Button 'Remove Selected' under the label 'Files to Profile': Deletes those files currently selected in the Files-to-Profile listbox.

- Button 'Delete All' under the label 'Files to Profile': Clears the contents of the Files-to-Profile listbox.

- Lower listbox (Files-to-Profile listbox): Displays the path and filename of text files to profile. The Files-to-Profile listbox also allows direct drag'n'drop of files. Also, a right-click context menu allows the user to modify the order of the entries as well as the operations 'Remove Entry ...\', 'Remove Selected', and 'Delete All'.

- Button 'Run Profiler': Starts the Profiling Engine.

- Button 'Abort Profiler': Signals the Profiling Engine to safely interrupt the processing in the shortest time possible (usually within a fraction of a second).

- Progress Bar: Offers a real-time display of the percentage of analysis completed.

- 'Profiling Task' panel: Offers a real-time description of the current process undertaken by the Profiling Engine.

In addition to the above, there are a number of GUI components unrelated to profiling:

- The 'Basic Vocabulary Profiler' banner: Left-click, hold, and drag to move about and reposition BVProfiler on the desktop.

- The 5 buttons on the top-right corner: Exit, maximize/restore, minimize, send to the system tray, and (un)anchor BVProfiler as the topmost window.

- Button 'About': Opens a dialog with information about the program and authors. It also provides access to BVProfiler's license agreement.

- Button 'Help': Brings up the off-line version of this document.

- The 'SequencePublishing.com' banner: Clicking on it opens the default browser and navigates to the the on-line home of BVProfiler.





3.2 Profiling [^]

From the point of view of the user, profiling is fully automatic. For those interested in the internal operation of the Profiling Engine, the algorithm is straight forward and roughly unfolds as follows:

Step 1: If there are no wordlists, proceed to step 5.

Step 2: A wordlist file is loaded, incrementally populating the wordlist table.

Step 3: The wordlist table is sorted and indexed to optimize subsequent lookups.

Step 4: If there are additional wordlist files to process, go back to step 2, otherwise proceed to step 5.

Step 5: A text is parsed, isolating one word at a time. Each word is then made lowercase (if it wasn't already) and an attempt is made to match it to an entry in the wordlist table. Failing this, a match is searched for in the off-list word table. If none is found, the current word is appended to the off-list word table.

Step 6: The off-list word table is sorted and indexed to optimize subsequent lookups.

Step 7: A report file is generated for this text.

Step 8: If there are additional texts to process, go back to step 5, otherwise proceed to step 9.

Step 9: If there were two or more texts, a master report is generated.





3.2.1 Which Languages can be Profiled? [^]

The key concept here is that of orthographic word boundaries. While in principle the Profiling Engine is language neutral, those languages that do not mark word boundaries are beyond the current capacity of the parsing algorithm employed.

Any language that uses no more than the following characters as word boundaries will be profiled correctly:

~ ` ! @ # $ % ^ & * ( ) _ - +  { [ } ] : ; ' < , > . ? / | \ " (blank) (tab)

Consequently, and although the UNICODE encoding provides language support for languages such as Chinese or Japanese, the behavior of the Profiling Engine on text in those languages is undefined.

Ideas and suggestions on the non-trivial issue of universal detection of orthographic word boundaries are welcome.



3.2.2 Digits and Numbers [^]

The Profiling Engine considers a number such as '14,350.453' as a single numeral rather than 8 unrelated digits.



3.2.3 Profiling without Wordlists [^]

BVProfiler has also been designed to "profile" without wordlists. In this situation, the Profiling Engine generates straight frequency counts of the words (types and tokens) that make up a text or texts. Report files are modified accordingly and include an opening warning announcing this.



3.3 Description of an Individual Report File [^]

Regardless of whether there are one or several files to profile, BVProfiler generates individual report statistics for each text supplied. Each report is written to the corresponding directory where each file is located and has the same name plus the extension '.bvp'.

A description of the report file by section is as follows:

- Section "FILES": Contains the name of the file containing the text profiled as well as a roster of the wordlist files (if any) employed.

- Section "OVERALL STATISTICS": Provides a table where each row details the percentage as well as the number of tokens and types by wordlist. It is important to note that if there are types appearing in several wordlists, an entry is generated for the relevant combination of wordlists sharing such types. Additionally, a row for off-list words is included if any are found. A row for digit numerals is also generated if any of these are found.

- Section "STATISTICS BY WORDLIST": This section is generated only if there are types appearing in more than one wordlist. The table shows a row per wordlist where percentages, token and type amounts reflect duplication of types across wordlists.

- Section "STATISTICS FOR THOSE TYPES APPEARING IN WORDLISTS": The table shows percentage and token amount per type as well as those wordlists where the type appears. This section is generated only if wordlists where employed.

- Section "ITEMIZATION OF THOSE TYPES APPEARING IN WORDLISTS": A list of types found to belong to wordlists. This section is generated only if wordlists where employed.

- Section "STATISTICS FOR OFF-LIST TYPES": The table shows percentage and token amount per type.

- Section "LOCATION OF OFF-LIST TOKENS": The table shows all off-list tokens, one token per row as well as the location (line and column) in the file profiled and the offending fragment where the token appears. The order of the rows reflects the order of occurrence in the file.





3.4 Description of a Master Report File [^]

When there is more than one file to profile, a master report is automatically generated. The master report is written to the directory of the first entry in the 'Files to Profile' listbox and has as name the tag '_BATCH' plus the system date and time and the extension '.bvp'. An example would be '_BATCH_072504_153104.bvp' referring to a batch-job conducted on July 25, 2004 at approximately 15h 31m 04s.

The structure of the master report is essentially the same as that of individual files with some modifications:

- Section "MASTER REPORT": This extra section contains data regarding the size of all files processed as well as a performance time breakdown by task (loading of lists, optimization of tables, profiling of each file, report generation).

- Section "FILES": Same as an individual report but containing the names and locations of all files profiled, the names and locations of all individual report files generated, and the wordlist files (if any) employed.

- Section "OVERALL STATISTICS": Same as an individual report but reflecting amounts and percentages across texts.

- Section "STATISTICS BY WORD LIST": Same as an individual report but reflecting amounts and percentages across texts.

- Section "STATISTICS FOR THOSE TYPES APPEARING IN WORDLISTS": Same as an individual report but reflecting amounts and percentages across texts.

- Section "ITEMIZATION OF THOSE TYPES APPEARING IN WORDLISTS": Same as an individual report but reflecting amounts and percentages across texts.

- Section "STATISTICS FOR OFF-LIST TYPES": Same as an individual report but with the inclusion of a reference to the text where the types where found.

- Section "LOCATION OF OFF-LIST TOKENS": This section is not included in the master report. Refer to individual reports to locate off-list tokens.





4. WordLists [^]

A wordlist file contains a type (a word) per row. The Profiling Engine handles contractions as single words ("don't" is a single word and so is "couldn't") but does not support hyphenated types (no word will match "so-called") or compound types separated by spaces (no word will match "pass around"). Refer to Section 3.2.1 for an explanation regarding the detection of orthographic word boundaries.



4.1 WordLists Provided [^]

The BVProfiler package (zip or executable setup) includes four wordlist files in the directory "\_WordLists":

- "_GSL.TXT": Essentially Bauman and Culligan's updated version of West's General Service List (2,284 words). The list is arranged from most to least frequent.

- "_AWL.TXT": Essentially Coxhead's Academic Word List (570 words). The list is arranged from most to least frequent.

- "_GSL_WF.TXT": Expands the GSL to include high-frequency word family members (7,764 words). The list is sorted alphabetically.

- "_AWL_WF.TXT": Expands the AWL to include high-frequency word family members (3,107 words). The list is sorted alphabetically.





4.2 Custom WordLists [^]

Any collection of words can be compiled into a wordlist and can be used as long as it is stored in a file containing a single type per row.

Remember to save the file as ASCII encoding if the texts to profile are in ASCII and to save the file as UNICODE encoding if the texts to profile are in UNICODE.



4.3 Numerals [^]

As explained in Section 3.2.2, the Profiling Engine detects and accounts for digit numerals automatically.

Note, however, that no wordlist is provided that includes ordinals (first, eleventh, hundredth, and so on) or cardinals (one, twelve, million, and so on). These types will be classified as off-list types unless an appropriate custom wordlist file is included in the profiling.



5. Example Case and Step-by-Step Operation [^]

This section describes the operation of a profiling session by means of a specific example.



5.1 Loading WordLists [^]

The wordlists employed for this example are GSL_WF.TXT and AWL_WF.TXT included in the package.

To load these files, click on 'Add Files' under 'Word Lists' and browse to the directory '\_WordLists' and select both files. The paths and filenames will appear in the wordlists lisbox.

Optionally, these files can be dragged and dropped directly on the wordlists listbox.

Note that all wordlist files in the listbox will be used by the Profiling Engine.



5.2 Loading Texts to Profile [^]

The texts to profile in this example are Lewis Carroll's "Alice In Wonderland", Jules Verne's "From_the Earth_to_the_Moon" (including the sequel "Round the Moon"), and Charles Darwin's "The Origin of Species", "The Descent Of Man", and "The Voyage of the Beagle".

These texts are not included in the BVProfiler package but can be freely downloaded from the Project Gutenberg website (http://www.gutenberg.net).

To load these files, click on 'Add Files' under 'Files to Profile' and browse to the directory were these files are located. The paths and filenames will appear in the Files-to-Profile listbox.

Optionally, these files can be dragged and dropped directly on the Files-to-Profile listbox.

Note that all texts in the listbox will be profiled by the Profiling Engine.



5.3 Running the Profiler [^]

Click on 'Run Profiler'.



5.4 Understanding the Report Generated for each File [^]

A report is generated for each file profiled.

Click on Darwin_The_Origin_of_the_Species.txt.bvp to view relevant excerpts from a sample report.

Profiling an entire book produces large reports mainly because each off-list token is listed accompanied by a fragment of the text where it occurs (the entire content of this report file adds up to 34,112 lines).

The text of Darwin's "The Origin of the Species" contains 208,985 tokens belonging to 8,909 unique types. Approximately half of these types (4,640) do not belong to either the word families of the GSL or the AWL and are, thus, classified as off-list types. Proportionally speaking, however, the amount of tokens for off-list types amounts to only 12.028% (25,137) of all tokens in the text. The 4,640 off-list types are listed in section "STATISTICS FOR OFF-LIST TYPES" and their corresponding 25,137 tokens are listed in section "LOCATION OF OFF-LIST TOKENS".



5.5 Understanding the Master Report Generated for all Files [^]

As the session involved the profiling of five files, a master report was also generated.

Click on _BATCH_072904_185215.bvp to view relevant excerpts from a master report.

The master report file is also large (24,571 lines) although not as large as individual report files because it does not include a table with the locations and fragments of all off-list tokens in the texts.

Opening the master report are the system date and time of the profiling session. Following are the amount of bytes processed for both the total of wordlist files (106,202 bytes) and the total of text files to profile (5,077,664 bytes). The 'Processing time breakdown...' claims the entire job took some 3 and a half minutes (211.775 seconds) and a file-by-file breakdown is posted.

The section 'OVERALL STATISTICS' shows that there is a cumulative total of 845,996 tokens belonging to 24,474 types in the five texts profiled. Although some 85% (721,421 tokens) of all tokens are in the wordlists supplied, the number of off-list types (17,217 types) amounts to 70% of all types found.

Note that the section 'STATISTICS FOR OFF-LIST TYPES' reports the text file where the types where found (for presentation purposes the filenames have been formatted in a column, the original report file contains a table entry per line).



5.6 System Resources Employed [^]

The speed of profiling is a function of the CPU and RAM characteristics of the computer employed, the number of tokens/types in the wordlists, and the number of off-list types and tokens found in the texts.

The conditions for this example are as follows:

Hardware description:

- CPU: AMD Athlon 1200 MHz.
- RAM: 256 MB (PC133 SDRAM).

Software description:

- OS: Windows 2000 Professional.
- BVProfiler: Version 0.1.4

Data description:

- Total Types: 24,474
- Listed Types: 7,259
- Off-listed Types: 17,217
- Total Tokens: 845,996
- Listed Tokens: 728,650
- Off-listed Tokens: 117,346

Under these conditions, the Profiling Engine usage of RAM peaked at a maximum of 49 MB while analyzing text containing a total of 845,996 words (not counting numerals), and creating in the process two tables, one with 7,259 entries and another with 17,217 entries.

The entire job took about 3 and a half minutes (211.75 seconds). The bulk of processing time was spent parsing through the texts while report generation was comparatively fast. Both these tasks, parsing and file writing, have minimal impact on RAM usage. Sorting, collapsing, and optimizing lookup access to these tables is carried out by efficient algorithms so that their impact on performance was minimal (~0.09% of total processing time).



6. Updates and Contact [^]

The development phase of the BVProfiler project has ended as all objectives have been accomplished and no additional functionality is planned for this product. Updates will be made available only if enhancements can be found for the algorithms employed or bugs are reported.

Visit http://www.sequencepublishing.com for up-to-date information regarding BVProfiler.



Feedback Form

We value feedback. Feel free to contact us via the form below (or send us an email if you prefer) with comments and questions.


Name:
Email:
Comments:




7. License Agreement [^]

READ CAREFULLY ALL TERMS OF THIS LICENSE AGREEMENT BEFORE USING THIS SOFTWARE

IF YOU DO NOT AGREE WITH THE TERMS OF THIS LICENSE AGREEMENT, YOU ARE NOT AUTHORIZED TO USE THIS SOFTWARE

Permission is hereby granted to anyone to freely use BVProfiler (hereafter referred to as 'this software') for any purpose with the exception of including it in a product, in which case both permission and acknowledgment are required.

To the maximum extent permitted by applicable law, the software and documentation are provided "as is". Franc Morales and Leah Gilner (hereafter referred to as 'the authors') disclaim all other warranties and conditions, either express or implied, including, but not limited to, implied warranties of merchantability, fitness for a particular purpose, conformance with description, title and non-infringement of third party rights. In no event shall the authors be liable for any indirect, incidental, consequential, special or exemplary damages or lost profits whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or any other pecuniary loss) arising out of the use or inability to use the software product, even if the authors have been advised of the possibility of such damages.

The authors allow you to distribute this software if all of the following conditions are met: you are not charging any money for it, the distribution files are kept together and unmodified, and the authors' permission is obtained before distribution. You may give this software package to friends or colleagues, burn it onto cd-rom's (or other media) and upload it to free/shareware sites as long as the original package remains unmodified.

All rights to this software (including any images or text incorporated into this software) are owned by the authors. However, the word lists (AWL.TXT, AWL_WF.TXT, GSL.TXT, and GSL_WF.TXT) provided with the installation package are placed in the public domain.

You may not disassemble or reverse engineer any part of this software.

You may not rent or lease this software.

The authors are not required to make available technical support for this software. The authors may, from time to time, revise or update this software. In so doing, the authors incur no obligation to furnish such revision or updates to you.

This license agreement will immediately and automatically terminate without notice if you fail to comply with any one of the terms and conditions cited. Upon termination of this license agreement, you agree to promptly remove the software from your system.

BVProfiler.
Copyright 2001-2007 by Franc Morales and Leah Gilner.
All rights reserved.