Kulyok Posted July 16, 2014 Posted July 16, 2014 It seems that the latest beta BG1 NPC has the best and most convenient code when it comes to tra files and converting stuff to utf-8 for EE installations. I was wondering if you folks would take a few minutes to write short tutorial for us to adapt in other mods? Basically, what files to take and which code to steal.
jastey Posted July 16, 2014 Posted July 16, 2014 Not having to support two sets of tra-files is preferable - especially if the mod is still in progress (Not to say - having two sets of tra files in different convertions is a pain in the ***). I am very much interested in this, as well.
AstroBryGuy Posted July 17, 2014 Posted July 17, 2014 Sounds like a good idea. That code was written by Isaya, so I'll see if Isaya wants to write it. If not, I understand it well enough to write something up.
Isaya Posted July 17, 2014 Posted July 17, 2014 Actually, Kulyok, the code in Xan Friendship is not really different, it's mostly a copy from BG1 NPC. I may even use it as a base to explain how to proceed. I'll see to writing a short guide in the next few weeks. In a discussion on the same topic on the WeiDU forum, Wisp provided a template for those who prefer adding the two sets of files in the mod. I'd rather suggest his way for most mods, notably those with only a few tra files, as it's easier to implement and does not require check on the operating system and including script depending on it. I could suggest an equivalent script to maintain the utf8 set of files from the initial ones to run before packing the mod in this case.
Kulyok Posted July 18, 2014 Author Posted July 18, 2014 Thing is, we don't have utf-8 files for most of our translations, and I personally don't think I should make them myself, because I'm not familiar with most of the languages and can't quality-control(what does this symbol mean? that one? is it good or is it gibberish?) That's why I was hoping your method would help us out by automatically creating the desired files in utf8.
Jarno Mikkola Posted July 18, 2014 Posted July 18, 2014 Thing is, we don't have utf-8 files for most of our translations...And you can't just take a file and then edit it with Notepad++ to make it into a "utf-8 without BOM" encoded-file ?
argent77 Posted July 18, 2014 Posted July 18, 2014 Thing is, we don't have utf-8 files for most of our translations...And you can't just take a file and then edit it with Notepad++ to make it into a "utf-8 without BOM" encoded-file ? I think the key problem is determining the right character set of the source files. Different localizations of the games require specific encodings (in some cases they are very exotic and not part of any official standard, e.g. for polish or chinese). Even Notepad++ can't do it automatically in many cases.
Isaya Posted July 18, 2014 Posted July 18, 2014 Thing is, we don't have utf-8 files for most of our translations, and I personally don't think I should make them myself, because I'm not familiar with most of the languages and can't quality-control(what does this symbol mean? that one? is it good or is it gibberish?) That's why I was hoping your method would help us out by automatically creating the desired files in utf8.Actually the process to obtain the UTF8 files is the same as during installation, except that you have to do it for each language, and provide the proper original encoding name. The question of knowing the proper encoding used in BG II (not EE) depending on language is the same whether you convert the files before packing the mod or during installation. At some point, you need to determine the initial encoding. I know them for western european languages (english, french, spanish, german, italian all use CP1252, portuguese/brazilian probably too), polish (CP1250, probably for czech too) and russian (CP1252) but I'm not able to provide an complete list. Due to the way BG II could handle fonts, to make the game in polish or cyrillic required specific font files. I have no clue which encoding was used to make versions in chinese or korean, for instance (they are available in BGT). All languages are not yet available in BGEE, for instance russian is only available in the latest beta version (I should try Xan with it). And you can't just take a file and then edit it with Notepad++ to make it into a "utf-8 without BOM" encoded-file ?In my experience (just tried now with Notepad++ 6.51, not the latest but not old either), this doesn't work except for the encoding used by Windows on your system. When I start typing special characters in the default CP1252 (in my country), then tell Notepad++ to convert to UTF8 without BOM, it works. However, if I open a file from the polish translation, encoded in CP1250, the text is still displayed using the character using the equivalent 8 bit code in CP1252. So I see things like superscript 1, 2 or 3 or the reverse ? of spanish instead of the proper polish character. These things remain after conversion to UTF8, instead of the intended polish character. As Argent77 said, Notepad++ has no way of guessing the encoding of 8 bit characters.
Kulyok Posted July 19, 2014 Author Posted July 19, 2014 Mostly it's the first four, so CP1252 it probably is, plus Polish(1250), plus Russian.
Wisp Posted July 19, 2014 Posted July 19, 2014 A thing you need to watch out for is that some translators have mixed several (typically two) character encodings in the same tra file. This was typically done in the olden days, when the text added to dialog.tlk needed to be encoded in, for example, CP1252, while the text displayed during the installation needed to be encoded in the corresponding MS-DOS code page and the modder had put the two kinds of strings into the same TRA file. Later this has become much less common, with translators instead choosing to go with English for the latter text, or simply dropping the diacritical marks from it, but it is still fairly common in the wild (I've personally unscrewed several mods, out of the handful I've done charset work on).
argent77 Posted July 19, 2014 Posted July 19, 2014 I've used the following list of character sets as reference when adding the Charset feature to NearInfinity (no liability assumed): English: CP1252French: CP1252German: CP1252Italian: CP1252Spanish: CP1252Polish: BG1: ISO-IR-179 (a supplement of ISO-8859-13), BG2: CP1250Czech: CP1250Russian: CP1251Japanese: Shift JIS, may also be (known as) CP932Simplified Chinese: CP936Traditional Chinese: ? (most likely some proprietary charset available in chinese Windows versions only)Korean: CP949
Kulyok Posted July 19, 2014 Author Posted July 19, 2014 A thing you need to watch out for is that some translators have mixed several (typically two) character encodings in the same tra file. This was typically done in the olden days, when the text added to dialog.tlk needed to be encoded in, for example, CP1252, while the text displayed during the installation needed to be encoded in the corresponding MS-DOS code page and the modder had put the two kinds of strings into the same TRA file. Later this has become much less common, with translators instead choosing to go with English for the latter text, or simply dropping the diacritical marks from it, but it is still fairly common in the wild (I've personally unscrewed several mods, out of the handful I've done charset work on). That happened with old Russian mods, yeah. I'm watching it in mine(thankfully easy) - all those MS-DOS lines went straight to English.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.