Author Topic:   Is my understanding of what Bill Brogdon says in his book correct?
maha anna
bartender
posted April 10, 2000 04:43 PM             
Hello all,
This info from Bill Brogden's Exam cram book bothers me. He says in readUTF(..)/writeUTF(..), the first 2 bytes writen are the length of the input str . But as per JAVA 2 API it is the total no of bytes calculated for ALL chars in the String after applying the UTF-8 conversion rules, which may be different from the total length
of the input String. I also verfied this by the foll prog. What the API says only happens.
So my qstn is does the foll info from Bill's book means the total length of the input string or not? Of it is total length of the string then it is NOT CORRECT. I had checked in his Errata page. It is not listed.

I know this post is too long. Please take your own time to verify this. I will be releived.
regds
maha anna

From Java API
If a character c is in the range \u0001 through \u007f, it is represented
by one byte
If a character c is \u0000 or is in the range \u0080 through \u07ff,then it is represented by two bytes
If a character c is in the range \u0800 through uffff , then it is
represented by three bytes
/*
From Bill Brogden Java 2 Exam Cram book /page 287 /paragraph 4 and 5
void writeUTF(String s)
This method specified in the DataOutput interface writes the 'LENGTH OF
THE STRING'
as a 2 byte unsigned integer, then the bytes encoding the
chars
String readUTF()
This method specified in DataInput interface, first reads 2 bytes to get
the COUNT OF THE NUMBER OF CHARACTERS TO FOLLOW , then reads and interprets
the stream until that many characters have been decoded and placed in the
String.
*/


import java.io.*;
class test {
public static void main(String[] args) throws Exception{

FileOutputStream fos = new FileOutputStream("text.dat");
DataOutputStream dos = new DataOutputStream(fos);

// dos.writeUTF("\u0001\u007f"); //prints 2 (2*1)
/* As per Bill, it must be 2.
Here the totlength of input str is = the total bytes
calculated using UTF-8. So both are SAME.
*/


// dos.writeUTF("\u0000");
// prints 2. (1*2). But as per Bill Brogden it MUST BE 1


//dos.writeUTF("\u0080\u07ff\u06ff\u05ff");
//prints 8. (4*2) But as per bill brodgen it MUST BE 4


//dos.writeUTF("\u0800\uffff\uefff\udfff");
//prints 12 . (4*3), but as per bill brodgen it MUST BE 4
dos.close();
fos.close();

FileInputStream fis = new FileInputStream("text.dat");
DataInputStream dis = new DataInputStream(fis);
String utfStr = dis.readUTF();
System.out.println(utfStr);
dis.close();
fis.close();

fis = new FileInputStream("text.dat");
dis = new DataInputStream(fis);
short s=0;
int totLength = dis.readUnsignedShort();
System.out.println("1st 2 bytes written="+totLength);
dis.close();
}

}

[This message has been edited by maha anna (edited April 11, 2000).]

William Brogden
greenhorn
posted April 11, 2000 09:05 AM             
Good call! The book is in error. Here is the documentation from the source for writeUTF. The first two bytes give the number of
bytes in the translated stream.

/**
* Writes a string to the underlying output stream using UTF-8
* encoding in a machine-independent manner.
*


* First, two bytes are written to the output stream as if by the
* writeShort method giving the number of bytes to
* follow. This value is the number of bytes actually written out,
* not the length of the string. Following the length, each character
* of the string is output, in sequence, using the UTF-8 encoding
* for the character. If no exception is thrown, the counter
* written is incremented by the total number of
* bytes written to the output stream. This will be at least two
* plus the length of str, and at most two plus
* thrice the length of str.


maha anna
bartender
posted April 11, 2000 09:25 AM             
Thank you Mr.William Brogden.
I didn't expect that your reply will be so quick for my email to you.
regds
maha anna

[This message has been edited by maha anna (edited April 16, 2000).]

Kondal Rao
greenhorn
posted April 17, 2000 11:16 AM         
Hi Maha,

I read your post. I have been reading this Bill Brogden ?Exam Cram? book. This book was written in an excellent way to prepare for the exam (atleast in my opinion). As I am seeing your answers in this forum, you take lot of interest in answering questions. I appreciate your effort in educating Java community. Having said that...here are my comments.

In your post you have mentioned about paragraphs 4 and 5. But you have eliminated, paragraph 3, which was given below. When you explain a concept in a book, you do not qualify the words completely in all of their occurences like a legal document. This defeats the purpose of explaining the concept and makes it hard. Similarly, we cannot take any individual sentence from a book and say that it is wring, forgetting the context of the sentence. When you read paragraphs 3,4,5 in order, you do not get the understanding that your are actually using the value returned by the length() method in String object for writing to the outputfile as the string length. At least I did not get that understanding.

?A single character may end up encoded in one, two, or three bytes. Because there is no direct equivalence between the number of characters encoded with UTF-8 in a file and number of bytes, the reading and writing methods use a special format.?

So I will not necessarily, call this as an error in the book.

Regards,
Kondal.

maha anna
bartender
posted April 17, 2000 11:52 AM             
Kondal Rao,
Thanks for your response. As you mentioned I took a lot of care before posting this message. I read that chapter over and over again whether what I think of the author thinks is corect or not . . I never just take a meaning from a single line or in fact single paragraph. I read the paragraph in the context with all previous and post paragraphs.

What I thought the author thought was.
In the previous paragraph what he says is correct. A single char may end-up encoded in 1 , 2 , 3 bytes. Because there is no direct equivalence bet the nop. of chars encoded with UTF-8 in a file and the no. of bytes . Having said that, He misinterprets the encoding scheme. He thinks a char may end up encoded in max 3 bytes. But at the same time he slips in applying that. Meaning,a String assuming "Kondal Rao", may end up in max of 10*3. This is true I also agree. But the author also says that the string under UTF-8 format is written to file in this format
[no.ofcharsinString][encoded_K][encoded_o][encoded_n]......[encoded_o] .
Where I thought the author's understanding is wrong was, the first 2 bytes writen which says the no of char to read The author says, it is no of chars in the String 'Kondal Rao' which here is 10. But what I say is NOT 10. It is the total no of bytes calculated after applying the UTF-8 conversion which may be (K(1byte) + o (2 bytes) +n (may be 2 bytes) + d(may be 2 bytes) ...
Then adding all may be n bytes , the result may be maximum of 10*3 (because max 3 bytes per char in UTF-8)= 30.

So the book is (was may be he had put in his errata)in fact in error. Also note that I wrote this post , NOT with the thinking of what I think is correct, but what the author really thinks he is correct, and if my understanding of what he thinks is correct, then the author is WRONG. which happened to be true.

Also note that I never pick-up on anybody just because I want to say something.Also I just did't purposly mention only those paragraphs and didn't leave the previous paragraph also. In fact I did really post the previous paragraph initailly , and after seeing the looooooooong post I edited it.
Fell free to reply for Maha's response.

regds
maha anna

[This message has been edited by maha anna (edited April 17, 2000).]

Kondal Rao
greenhorn
posted April 17, 2000 01:00 PM         
Maha,

After reading your recent post, I have gone through the ?Streams and Characters? section of chapter 14. After reading this, I can never come to a conclusion that when the String ?Kondal Rao? will be written to the output file, the first two bytes (the UTF-8 encoded string length) will be 10.

Instead I also came to the same conclusion as you that the following will be written to a file using UTF-8 encoding:

[length in bytes after encoding in UTF-8] [encoded_K][encoded_o][encoded_n]......[encoded_o].

Anyway, let us not argue that the Rose is Red. The important issue here is, Are we clear in our concepts? Which is ?Yes? in this case, so let us move on.

Regards,
Kondal.

maha anna
bartender
posted April 17, 2000 01:51 PM             
Kondal Rao,
It is here. From the author's book. Note the red color font.
void writeUTF(String s)
This method specified in the DataOutput interface writes the 'LENGTH OF THE STRING' as a 2 byte unsigned integer, then the bytes encoding the chars

Anyway I am happy most of the readers didn't notice this and at the same time I wonder if they understood correctly.

Ok I won't continue to reply for your post again.
regds
maha anna

|