So I'm doing some arabic processing using java, and I'm relying on extracting some arabic that's encoded in UTF-8 from an xml document. However, when I go to print (system.out) the text that I've extracted, all I see is question marks. Now I can process the same thing in ruby and have it print the arabic into the console, and I was under the impression that because the encoding was specified in the xml that the DOM that I'm using to parse it would take care of any encoding issues. The character counts seem to line up, but I'm not sure why it seems to miss the encoding. Does anyone know the cause of this?
Here's some of the xml (if the browser doesn't garble it)
<entity_attributes>
<name NAME="الأمريكية">
<charseq START="1533" END="1541">الأمريكية</charseq>
</name>
<name NAME="واشنطن">
<charseq START="1509" END="1514">واشنطن</charseq>
</name>
</entity_attributes>
</entity>
This is what some output looks like (not specifically for the above snippet):
Entity Type: TYPE="PER"
Head: START = 1042 END = 1050 ????? ???
Extent: START = 1042 END = 1050 ????? ???
Here's some of the xml (if the browser doesn't garble it)
<entity_attributes>
<name NAME="الأمريكية">
<charseq START="1533" END="1541">الأمريكية</charseq>
</name>
<name NAME="واشنطن">
<charseq START="1509" END="1514">واشنطن</charseq>
</name>
</entity_attributes>
</entity>
This is what some output looks like (not specifically for the above snippet):
Entity Type: TYPE="PER"
Head: START = 1042 END = 1050 ????? ???
Extent: START = 1042 END = 1050 ????? ???