Java Answers Forum - java vectors question

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Answers Forum
java vectors question

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Stephen Walsh

Posts: 1
Nickname: dumbass
Registered: Oct, 2002

java vectors question

Posted: Oct 7, 2002 11:14 AM

I am a student currently struggling through a java subject at uni i was wondering if anyone could help me out with this question. I know nothing about vectors or a whole load about java either. I can send the files that the program talks of to anyone who is interested in helping me out with eve a little bit. please help me please. I anyone could help me out in anyway please do so i will greatly appreciate it.

P02learn.java
This program implements the main functions (learning and predicting) of an adaptive information agent. A user may possess many different initial interests (topics). This program will learn each interest separately. The resulting interest profile will be stored in F13.txt (a normalized prototype vector). All the possible users? interests (topics) are provided to you in F08.txt. The pre-defined user?s judgment of whether a news article is relevant with respect to a particular topic is stored in F07.txt. Your program will do:
1. accept a topic no. (m) from the command line;
2. select the corresponding topic (keywords and weights) from the matching records (records with the same topic no.) in F08.txt to build the initial prototype vector in memory (e.g. Java vector objects F08); refer to the file specification for the details of F08.txt;
3. read the subset of relevance judgment records from F07.txt file into memory (e.g., Java vector objects F07); each judgment record in F07.txt is pertaining to a specific topic and a particular document ID. One only needs to read the F07 records with matching topic no. (from command line) and the user judgment field = ?1? into the Java vector objects; refer to the file specification of F07.txt.
4. read each article (record) from the input file F05.txt
5. compute the cosine similarity score between the news article (a F05 record) and the prototype vector representing the user?s current interest (the F08 Java vector objects); if the cosine score is equal to or greater than a pre-defined system threshold, the news article is considered relevant by the agent system; otherwise non-relevant;
You need to run the whole system several times in order to identify a reasonable system threshold which maximise the F-measure;
6. use the document ID from a F05 record to search (using Java?s built-in Binary search) the F07 Java vector object; if a matching element is found, the corresponding news article (F05 record) is judged relevant by the user; otherwise it is non-relevant.
7. For each news article (F05 record), create a F11 record to store the cosine score, system judgment, and user judgment (infer from F07 Java vector objects); refer to the file specification for the details of F11.txt.
8. based on whether the incoming article is relevant or not, update/expand the prototype vector representing the user?s current interest (i.e., the F08 Java vector objects) using the Rocchio method; normalize the prototype vector after each updating. Since your prototype vector is updated after every (n) documents are read, your agent system becomes adaptive in predicting forthcoming articles;
9. at the end of program execution (i.e., after reading all the F05 records), write the F08 Java vector objects to the user profile file F13.txt; the first record must be: TOPIC m, whereas m is a topic no. (refer to the file description of F13.txt); each record in F13.txt contains a term with NON-ZERO POSITIVE TFIDF weight. If a term with zero weight, don?t write the record to F13.txt. The records in F13.txt are written in descending order of normalised TFIDF weights.
10. at the end of program execution, also send messages to the console:
no. of documents read (F05.txt);
no. of records written to F11.txt;
no. of terms held in F13.txt;
no. of known relevant documents (F07.txt);
no. of known non-relevant documents (F07.txt);
no. of relevant documents predicted by system;
no. of non-relevant documents predicted by system;
the elapsed time in hours/minutes/seconds of program execution.

My test script will invoke your program in this way (all files are assumed in the current directory):
java P02learn m n F05xyz.txt F07xyz.txt F08xyz.txt F11xyz.txt F13xyz.txt

The first parameter (m) is the topic no. for learning. The second parameter (n) defines the frequency of updating the prototype vector (e.g., after every n documents are read). If there is an existing F13xyz.txt in the current directory, your program will completely overwrite the content of this file; otherwise a new F13xyz.txt file will be created by your program.

Here is what the files are ment to look like:

F05.txt ? The TFIDF vector file for the Reuters news articles. The record format is similar to that of F02.txt, but the weight of each term is normalized TFIDF instead of TF.
Record Format:
document-id characters
term characters
TFIDF weight numeric (in the interval [0,1], 5 decimal places)

Example:
R2100 java 0.84321 agent 0.31111 nasdaq 0.21000 apple 0.11334
R2101 bill 0.64323 japan 0.61211 car 0.31111 star 0.20101 phone 0.12345

F07.txt ? The relevance judgment file. If a topic id + document id is not found in this file, it implies that the particular news article is not relevant with respect to the given topic. The file is available from ITB263?s OLT site.

Record Format:
topic-id integer (1-135)
document-id characters
user judgment integer (?1? ? relevant, ?0? ? non-relevant)

Example:
135 R1100 1
135 R1101 1
1 R2161 1

F08.txt ? The initial query (topic) file; each record corresponds to a user?s specific interest. The file is available from ITB263?s OLT site.

Record Format:
topic-id integer (1-135)
weight numeric (with up to five decimal points)
term characters

Example:
1 1.0 acq
1 0.6 xxx
1 0.5 yyy
2 1.0 alum
3 1.0 austdlr
4 1.0 austral
133 1.0 wpi
134 1.0 yen
135 1.0 zinc

F11.txt ? The result file; each record corresponds to a news article and contains both the system?s prediction and the user?s relevance judgment. For the program P03learn.java, the cosine-score field is blank.

Record Format:
document-id character
system judgment integer (?1? ? relevant, ?0? ? non-relevant)
user judgment integer (?1? ? relevant, ?0? ? non-relevant)
cosine-score numeric (five decimal places)

Example:
R2100 1 0 0.50009
R2101 1 1 0.24321
R2102 0 0 0.00001

F13.txt ? The user profile file; it contains the terms and weights about the user?s topical interest. These terms and weights are learnt by your program. The first record is a header record and the remaining records are the term records. The terms records are in descending order of normalised TFIDF weights.

Header Record:
TOPIC n (n is an integer for the topic-id)
Record Format (remaining records):
term characters
normalized TFIDF weight (normalized in the interval [0,1], 5 decimal places)

Example:
TOPIC 3
program 0.74321
agent 0.61000
software 0.31000
os 0.11334

Previous Topic

Next Topic


	Web Artima.com