|
I have written a module for corpora statistical analysis on Latin coded corpora. Would like help about its name.
The top of the module (originaly written in utf8) is:
# $Id: CStatiBR.pm,v 1.0 2007/06/12 09:17:36 rpfernandes Exp $
#Copyright (c) 2007 Rodrigo Panchiniak Fernandes. All rights reserved.
#
#
#
# This program is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
=head1 NAME
Text::CStatiBR - performs corpora statistical analises
=head1 SYNOPSIS
use CText::CStatiBR;
&Text::CStatiBR::CSTATIBR();
=head1 DESCRIPTION
Text::CStatiBR creates a seven colum CSV file output with one line each
token per text given as input a corpus that files names follows '
1 (1). txt', '1 (2). txt', ..., '1 (n).txt' or
1 \(([1-9]|[1-9][0-9]+)\)\.txt
Coloums stores statistical information:
(1) number of word forms in document d;
(2) number of tokens in d;
(3) Id number of d, ie., n;
(4) frequency of term t in d;
(5) corpus frequency of t ;
(6) document frequency of t (number of documents where t occours at least once);
(7) t, UTF8 latin coded token-string delimited by /(impossible to publish here: utf8)
Main output file name is '1 (n + 5).txt' and it is stored in the same directory as
the corpus, toghether with residual files on each input file with .txu and .txv extensions.
This code was written under CAPES BEX-09323-5
=head2 Methods
Example:
#!/usr/bin/perl
use strict;
use Text::CStatiBR;
&Text::CStatiBR::CSTATIBR("5"); #5 files are analised.
=over
=cut
Thanks |