Java, Unicode, and output to console

chomsky · Apr 23, 2007

So I'm doing some arabic processing using java, and I'm relying on extracting some arabic that's encoded in UTF-8 from an xml document. However, when I go to print (system.out) the text that I've extracted, all I see is question marks. Now I can process the same thing in ruby and have it print the arabic into the console, and I was under the impression that because the encoding was specified in the xml that the DOM that I'm using to parse it would take care of any encoding issues. The character counts seem to line up, but I'm not sure why it seems to miss the encoding. Does anyone know the cause of this?
Here's some of the xml (if the browser doesn't garble it)
<entity_attributes>
<name NAME="الأمريكية">
<charseq START="1533" END="1541">الأمريكية</charseq>
</name>
<name NAME="واشنطن">
<charseq START="1509" END="1514">واشنطن</charseq>
</name>
</entity_attributes>
</entity>

This is what some output looks like (not specifically for the above snippet):
Entity Type: TYPE="PER"
Head: START = 1042 END = 1050 ????? ???
Extent: START = 1042 END = 1050 ????? ???

generelz · Apr 23, 2007

What JAXP implementation are you using to parse the XML (Xerces, etc.)? Does the XML have a proper prologue declaration? Could you possibly post some of the code you are using to extract the strings? It would really help to get a complete XML document and some complete code, the smallest possible piece which exhibits the problem and that will make it very easy to debug/test.

chomsky · Apr 23, 2007

For legal reasons I can't post the whole xml file, but here's the start of the it. I noticed the encoding declaration is in an unusual spot, so I tried the standard one but still had no luck.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE source_file SYSTEM "apf.v5.1.1.dtd">
<source_file URI="NTV20001002.1530.0534.sgm" SOURCE="broadcast news" TYPE="text" AUTHOR="LDC" ENCODING="UTF-8">
<document DOCID="NTV20001002.1530.0534">
<entity ID="NTV20001002.1530.0534-E1" TYPE="PER" SUBTYPE="Group" CLASS="SPC">
<entity_mention ID="NTV20001002.1530.0534-E1-2" TYPE="NOM" LDCTYPE="NOM">
<extent>
<charseq START="371" END="395">قبائل جزين جنوب شرق لبنان</charseq>
</extent>
<head>
<charseq START="371" END="375">قبائل</charseq>
</head>
</entity_mention>

And here's what I have so far in terms of parsing it:
(I'm working with some generic java almanac stuff so
the comments are still a bit quirky)

Code:

import java.io.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import javax.xml.parsers.*;

public class NEParser{

	public static  void main(String[] args){
		Document doc = parseXmlFile(args[0], false);
		System.out.println("DOM Document constructed");
		visit(doc, 0);
		System.out.println("&#1575;&#1604;&#1604;&#1576;&#1606;&#1575;&#1606;&#1610;");
	}



	public static Document parseXmlFile(String filename, boolean validating) {
		try {
			// Create a builder factory
			DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
			factory.setValidating(validating);

			// Create the builder and parse the file
			Document doc = factory.newDocumentBuilder().parse(new File(filename));
			return doc;
		} catch (SAXException e) {
			// A parsing error occurred; the xml input is not valid
		} catch (ParserConfigurationException e) {
		} catch (IOException e) {
		}
		return null;
	}




	// This method visits all the nodes in a DOM tree
	public static void visit(Node node, int level) {
		// Process node

		// If there are any children, visit each one
		NodeList list = node.getChildNodes();
		for (int i=0; i<list.getLength(); i++) {
			// Get child node
			Node childNode = list.item(i);
			if(childNode.getNodeName().equals("entity_mention")){
				NamedNodeMap nnm = node.getAttributes();
				if(nnm != null){
					int len = nnm.getLength() ;
					Attr attr;
					attr = (Attr)nnm.getNamedItem("TYPE");
					//	System.out.println(attr);
					EntityMention em = parseMention(childNode, attr + "");
					System.out.println(em);
				}
			}
			else{
				// Visit child node
				visit(childNode, level+1);
			}
		}
	}

	public static EntityMention parseMention(Node node, String type){
		//System.out.println(type);
			CharSeq head = null;
			CharSeq extent = null;

		NodeList list = node.getChildNodes();
		for (int i=0; i<list.getLength(); i++) {
			// Get child node
			Node childNode = list.item(i);
						if(childNode.getNodeName().equals("head")){
				head = parseHead(childNode);
			}
			else if(childNode.getNodeName().equals("extent")){
				extent = parseHead(childNode);
				}
			}
			return new EntityMention(head, extent, type);
		}


		public static CharSeq parseHead(Node node){
			CharSeq cs = null;
			NodeList list = node.getChildNodes();
			for (int i=0; i<list.getLength(); i++) {
				// Get child node
				Node childNode = list.item(i);

				//	System.out.println(childNode.getNodeName());
				if(childNode.getNodeName().equals("charseq")){
					cs =  parseCharSeq(childNode);
				}
			}		
			return cs;
		}

		public static CharSeq parseCharSeq(Node node){
			CharSeq cs = null;
			NamedNodeMap nnm = node.getAttributes();
			if(nnm != null){
				int len = nnm.getLength() ;
				Attr start = (Attr)nnm.getNamedItem("START");
				Attr end   = (Attr)nnm.getNamedItem("END");
				//	System.out.println(node);
				NodeList list = node.getChildNodes();
				for (int i=0; i<list.getLength(); i++) {
					// Get child node
					Node childNode = list.item(i);
					String arabicText = childNode.getNodeValue();
					int startInt = Integer.parseInt(start.getValue());
					int endInt = Integer.parseInt(end.getValue());
					//	System.out.println(start +"\t" + end + "\t" + type + "\t" + arabicHead);
					cs = new CharSeq(arabicText, startInt, endInt);
				}		
			}
			return cs;
		}

	}

generelz · Apr 23, 2007

Could you reformat your post with [ code ] tags please?

Have you tried setting the passing the file.encoding system property by adding a -Dfile.encoding=UTF-8 argument?

mikeblas · Apr 23, 2007

chomsky said:
So I'm doing some arabic processing using java, and I'm relying on extracting some arabic that's encoded in UTF-8 from an xml document. However, when I go to print (system.out) the text that I've extracted, all I see is question marks.

Java is available for many platforms. Upon which one are you observing this behaviour?

Java is available for many platforms. Which ones are you targeting with your executable?

Is the console some window made available by your tools, or the console window your OS provides?

chomsky · Apr 23, 2007

Nah, haven't tried that yet

chomsky · Apr 23, 2007

This is java 1.5 on OS X (.4), for use on that same system. The console is my "terminal", my stdout.

chomsky · Apr 23, 2007

I'm not sure what the -Dfile.encoding is per se but setting the system encoding internally in the code with
System.setProperty("file.encoding", "UTF-8");
didn't fix it.

generelz · Apr 24, 2007

My two guesses, since I can't compile your code:

1.) Your CharSeq class, if it is doing any manual byte manipulation may be mangling the characters. If it is working only with Java String and char primitives it should be OK.

2.) The way the DocumentBuilder is reading in the file is not correct. I would suggest constructing a UTF-8 capable InputStream and passing the InputStream to the factory rather than just passing it the File object. My guess is it is using the platform default encoding for the file, or doing some sort of auto detection which may not be sufficient.

Perhaps something like this:

Code:

Document doc = factory.newDocumentBuilder().parse(new InputSource(new InputStreamReader(new FileInputStream(filename), "UTF-8")));

Edit:

This may be a little cleaner

Code:

InputSource inputSource = new InputSource(new FileReader(filename));
inputSource.setEncoding("UTF-8");
Document doc = factory.newDocumentBuilder().parse(inputSource);

chomsky · Apr 24, 2007

I didn't include the two other classes because they're just getters/setters (and I didn't think anyone would be nice enough to go as far as to try and compile it) but I'll put them at the end of this post. Anyway, I was thinking along your lines in thinking that it was something with the way I was reading in the file but so far no changes to it, including the line you posted, have worked. It's quite irritating as it "just works" in ruby, I don't know where this is going wrong. Even if I could just convert the java Strings to other String objects with the proper encoding that would be fine, but so far I haven't stumbled upon the right combination of input/output encoding, but I'm still going.

Code:

public class CharSeq{
	private String text;
	private int start;
	private int end;
	
	
	public CharSeq(String text, int start, int end){
		this.text = text;
		this.start = start;
		this.end = end;
	}
	
	public String getText(){
		return text;
	}
	
	public int getStart(){
		return start;
	}
	
	public int getEnd(){
		return end;
	}
	
	public String toString(){
		return "START = " + start + "\tEND = " + end + "\t" + text;
	}
	
}






public class EntityMention{
	private CharSeq head;
	private CharSeq extent;
	private String type;
	
	public EntityMention(CharSeq head, CharSeq extent, String type){
		this.head = head;
		this.extent = extent;
		this.type = type;
	}
	
	public CharSeq getHead(){
		return head;
	}
	
	public CharSeq getExtent(){
		return extent;
	}
	
	public String getType(){
		return type;
	}

	public String toString(){
		return "Entity Type: " + type + "\nHead: " + head + "\nExtent: " + extent + "\n";
	}
}

chomsky · Apr 24, 2007

Oh my, that last edit you posted seems to have done the trick. I still have to investigate a bit, but thanks! This might be case-closed

Java, Unicode, and output to console

chomsky

Limp Gawd

generelz

Limp Gawd

chomsky

Limp Gawd

generelz

Limp Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

chomsky

Limp Gawd

chomsky

Limp Gawd

chomsky

Limp Gawd

generelz

Limp Gawd

chomsky

Limp Gawd

chomsky

Limp Gawd