Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Detect records with encoded characters in the file

Status
Not open for further replies.

amkipnis

Programmer
Apr 15, 2003
21
US
Greetings,
I have a datafile with records containing encoded characters (unknown format). I need to write a script which would find and move those records to another directory as well as subtitute those encoded characters with question marks.
Here is what I wrote so far... For some reason, any records with white spaces, tildas (~), dashes (-) are also being moved among with the records with encoded characters.
I appreciate any help in advance!
P.S. The data below is just a segment of the 5K records file. The other records do not start with "ALL_LISTS.."
I am running the script on Linux server - Linux sles1 2.6.16.60-0.21-bigsmp
===================
ALL_LISTS.dat
===================
ALL_LISTS_1004_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
ALL_LISTS_1005_UC004 Efetivação da Abertura CP 2424-07 Conta Pré-Aberta Solução Employer.doc
ALL_LISTS_1498_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
ALL_LISTS_1499_UC004 Efetivação da Abertura CP 2424-07 Conta Pré-Aberta Solução Employer.doc
ALL_LISTS_1500_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
ALL_LISTS_1930_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
ALL_LISTS_2072_UC01_Aplicação_CDB_DI - Supl DTS 10.03.2008
ALL_LISTS_2073_UC02_Resgates_CDB_DI_- Supl DTS10.03.2008
ALL_LISTS_2074_UC03_Agendamentos_CDB_DI_ - Supl DTS10.03.2008
ALL_LISTS_2091_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
ALL_LISTS_2105_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
ALL_LISTS_2293_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
ALL_LISTS_2307_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
ALL_LISTS_2596_UC01_nova politica da tsy cdb di.doc
ALL_LISTS_2922_Testes.xls
ALL_LISTS_3314_TAP_Cobrança de Encargos - Exclusão Cheques HSBC.doc
ALL_LISTS_3367_contrato 6090766690 teste liq. antec
ALL_LISTS_3488_UC001_Contencioso_ ECC_Rev.doc
ALL_LISTS_348_TAP-CP1089-Conta Losango-Tarifa e Capitalização_V1.0.doc
ALL_LISTS_348_TAP-CP1089-Conta Losango-Tarifa e Capitalização_V1.1.doc
ALL_LISTS_3641_fci0066.Termo de Adesão_INSS_17.02.2010.pdf
ALL_LISTS_3642_U104 - Demonstrar Taxa Plena na Contratação_INSS_HLS_V10.doc
ALL_LISTS_3642_U105 - Demonstrar Taxa Plena na Contratação_Extra Money_ HLS_V10.doc
ALL_LISTS_3642_U106 - Demonstrar Taxa Plena na Contratação_Consignação_HLS_V10.doc
ALL_LISTS_3642_U110 - Imprimir Termo Termo de Adesão em Proposta em Andamento_HLS_V12.doc
ALL_LISTS_3642_U111 - Imprimir Termo de Adesão em Impressão de Documentos_HLS_V12.doc
ALL_LISTS_3642_U112 - Consultar Taxa Plena do Contrato_HLS_V10.doc
ALL_LISTS_3643_U101 - Demonstrar Taxa Plena na Contratação Beneficiário INSS_CNB_V12.doc
ALL_LISTS_3643_U102 - Demonstrar Taxa Plena na Contratação_Funcionário Privado_CNB_V11.doc
ALL_LISTS_3643_U103 - Demonstrar Taxa Plena na Contratação_Funcionário Público_CNB_V11.doc
ALL_LISTS_3644_U101 - Demonstrar Taxa Plena na Contratação Beneficiário INSS_CNB_V12.doc
ALL_LISTS_3644_U102 - Demonstrar Taxa Plena na Contratação_Funcionário Privado_CNB_V11.doc
ALL_LISTS_3644_U103 - Demonstrar Taxa Plena na Contratação_Funcionário Público_CNB_V11.doc
ALL_LISTS_3647_UC001_ECC_Priorizar_crédito_Integrada_IR.doc
ALL_LISTS_404_RS_Do_CP_0446_07_Saq_Dep_Cta_Losango_Corresp_Bancário.doc
ALL_LISTS_404_TAP_CP_0446_07_Saq_Dep_Cta_Losango_Corresp_Bancário.doc
ALL_LISTS_404_UC_Do_CP_0446_07_Depósito.doc
ALL_LISTS_404_UC_Do_CP_0446_07_Saque.doc
ALL_LISTS_437_TAP_Encargos Excesso Limite.doc
ALL_LISTS_447_Requeriments Specification-V.1.0 - Extinção de Isenção Acumulada.doc
ALL_LISTS_447_Test Activity Plan - Extinção da Isenção Acumulada Inv.+ Neg..doc
ALL_LISTS_447_UC 001 - Pacote para mapear clientes contemplados com a isenção acum.doc
ALL_LISTS_447_UC 002 - Aplicação de melhor Percentual.doc
ALL_LISTS_480_Test Activity Plan - TAC Cheque Especial.doc
ALL_LISTS_704_TC007 - Crédito Parcelado .xls
ALL_LISTS_704_UC007 - Crédito Parcelado v1_4.doc
ALL_LISTS_712_RS_Do_CP_1434_06_Criar_Rel_ctas_Paralisadas.doc
ALL_LISTS_712_TAP_CP_1434_06_Criar_Rel_Ctas_Paralisadas.doc
ALL_LISTS_712_UC_Do_CP_1434_06_Criar_Rel_Ctas_Paralisadas.doc
ALL_LISTS_720_TAP - Infra-estrutura GRI - Cheques HSBC.doc
ALL_LISTS_723_TAP_Nova Ordem de Uso de Recursos - GRI.doc
ALL_LISTS_724_TAP_Nova Ordem de Uso de Recursos - GRI.doc
ALL_LISTS_727_TAP_Nova Ordem de Uso de Recursos - GRI.doc
ALL_LISTS_749_TAP_Campo de Desconto Para Cheque Especial.doc
ALL_LISTS_762_TAP_Cobrança de Encargos - Exclusão Cheques HSBC.doc
ALL_LISTS_772_RS_CP1227-2007_Repasse TAC HSPL.doc
ALL_LISTS_772_Test Activity Plan - UAT - Parte 1.doc
ALL_LISTS_772_Test Activity Plan - UAT - Parte 2.doc
ALL_LISTS_772_UC001 - Criar novo código de tarifa.doc
ALL_LISTS_781_Test Activity Plan - Alteração Movimentadores Conta Corrente.doc
ALL_LISTS_794_TAP_do_CP_0298_07_Posição Semestral de Estoque de Talonários.doc
ALL_LISTS_806_TAP_do_CP_1241_06_Retirar Trava Que Impede Encerramento de Poup Tarifada.doc
ALL_LISTS_810_TAP_do_CP_1640_07_Geração de Dados para Inclusão da Poupança na CNV.doc
ALL_LISTS_816_TAP_do_CP_0181_07_Alteração Tipo Poupança 14 para 39.doc
ALL_LISTS_928_TAP_do_CP_0298_07_Posição Semestral de Estoque de Talonários.doc
ALL_LISTS_978_TAP_Rentabilidade Fundos_No Meu_ HSBC.doc
ALL_LISTS_978_UC001_CP_0731_2007_Rentabilidade_Fundos_IB .doc
ALL_LISTS_998_UC001 Reserva CP 2424-07 Conta Pré-aberta - solução Employer e CP 2427-07 Conta Pré-aberta Losango.doc
===========================================================
here is my script
===========================================================
#!/bin/bash
rootdir=/root/alex/26775_MinorProjectPFS_2008/
scrdir=${rootdir}/scripts/
indir=${rootdir}copy_attach/
boutdir=${rootdir}captured/
goutdir=${rootdir}toretain/
#flist=${scrdir}filelist.dat
flist=${scrdir}ALL_LISTS.dat
# Move filenames in current directory containing bad characters to another directory.
#for filename in $flist
while read filename
do
badname=`echo "$filename" | egrep "[^a-zA-Z0-9_\t\!\~\.\/\-\_\s]"`
goodname=`echo "$filename" | egrep -v "[^a-zA-Z0-9_\/ \/\!\~\t\.\/\-\_\s]"`
cp -p ${indir}"${badname}" $boutdir
cp -p ${indir}"${goodname}" $goutdir
done<$flist
exit
 
Hi

You mean the highlighted charecters : "Conta Pr[highlight]é[/highlight]-aberta - solu[highlight]çã[/highlight]o Employer" ?

It is unicode. Why not just convert it into something else to have "Conta Pr[highlight]é[/highlight]-aberta - solu[highlight]çã[/highlight]o Employer" ? Is not better than with question marks ?

You can do it for example with [tt]recode[/tt] :
Code:
recode utf8.. < ALL_LISTS.dat
Note that [tt]recode[/tt] is GNU software. On Unix systems probably you will have to use something else.

Feherke.
 
Hi Feherke,
I appreciate your suggestion. I found utility called convmv. However, am not sure how to use it well.
I tried the following command (based on man pages for convmv) but it did not update the file names.
Being under the directory with encoded file names, I issued:
convmv -f utf8 -t ascii . --notest
========================================================
convmv --help
convmv 1.09 - converts filenames from one encoding to another
Copyright (C) 2003-2005 Bjoern JACKE <bjoern@j3e.de>

This program comes with ABSOLUTELY NO WARRANTY; it may be copied or modified
under the terms of the GNU General Public License version 2 as published by
the Free Software Foundation.

USAGE: convmv [options] FILE(S)
-f enc encoding *from* which should be converted
-t enc encoding *to* which should be converted
-r recursively go through directories
-i interactive mode (ask for each action)
--nfc target files will be normalization form C for UTF-8 (Linux etc.)
--nfd target files will be normalization form D for UTF-8 (OS X etc.)
--qfrom be quiet about the "from" of a rename (if it screws up your terminal e.g.)
--qto be quiet about the "to" of a rename (if it screws up your terminal e.g.)
--exec c execute command instead of rename (use #1 and #2 and see man page)
--list list all available encodings
--lowmem keep memory footprint low (see man page)
--nosmart ignore if files already seem to be UTF-8 and convert if posible
--notest actually do rename the files
--replace will replace files if they are equal
--unescape convert%20ugly%20escape%20sequences
--upper turn to upper case
--lower turn to lower case
--help print this help
 
Hi

Generally the options should come first and parameters after. No idea if this is the case of [tt]convmv[/tt] too, but I would try :
Code:
convmv -f utf8 -t ascii --notest .
And is "utf8" listed like that in [tt]convmv --list[/tt] ? I would try "UTF-8 " instead.

Feherke.
 
Thank you, Feherke. But, it does not change the name with either UTF-8 or utf8 option.
Here is an entire list of supported encodings:
7bit-jis
AdobeStandardEncoding
AdobeSymbol
AdobeZdingbat
ascii
ascii-ctrl
big5-eten
big5-hkscs
cp1006
cp1026
cp1047
cp1250
cp1251
cp1252
cp1253
cp1254
cp1255
cp1256
cp1257
cp1258
cp37
cp424
cp437
cp500
cp737
cp775
cp850
cp852
cp855
cp856
cp857
cp860
cp861
cp862
cp863
cp864
cp865
cp866
cp869
cp874
cp875
cp932
cp936
cp949
cp950
dingbats
euc-cn
euc-jp
euc-kr
gb12345-raw
gb2312-raw
gsm0338
hp-roman8
hz
iso-2022-jp
iso-2022-jp-1
iso-2022-kr
iso-8859-1
iso-8859-10
iso-8859-11
iso-8859-13
iso-8859-14
iso-8859-15
iso-8859-16
iso-8859-2
iso-8859-3
iso-8859-4
iso-8859-5
iso-8859-6
iso-8859-7
iso-8859-8
iso-8859-9
iso-ir-165
jis0201-raw
jis0208-raw
jis0212-raw
johab
koi8-f
koi8-r
koi8-u
ksc5601-raw
MacArabic
MacCentralEurRoman
MacChineseSimp
MacChineseTrad
MacCroatian
MacCyrillic
MacDingbats
MacFarsi
MacGreek
MacHebrew
MacIcelandic
MacJapanese
MacKorean
MacRoman
MacRomanian
MacRumanian
MacSami
MacSymbol
MacThai
MacTurkish
MacUkrainian
MIME-B
MIME-Header
MIME-Header-ISO_2022_JP
MIME-Q
nextstep
null
posix-bc
shiftjis
symbol
UCS-2BE
UCS-2LE
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-7
utf-8-strict
utf8
viscii
=====================================
In any case, by looking at my original script, what would I need to modify to ignore capturing files containing "dashes","tildas" and "white spaces". Those are considered a valid files. I really want to capture any files which names contain encoding characters (non-alpha,non-numeric, ignoring spaces,"~", "-", "_").
Thank you.
 
Hi

A would change the approach, because [tt]grep[/tt]ing one-by-one is slower than [tt]grep[/tt] all :
Code:
egrep -x [green][i]'[a-zA-Z0-9_[/i][/green][lime][i]\t[/i][/lime][green][i]!~./_ -]+'[/i][/green] ALL_LISTS[teal].[/teal]dat [teal]>[/teal] ALL_LISTS-good[teal].[/teal]dat

egrep -xv [green][i]'[a-zA-Z0-9_[/i][/green][lime][i]\t[/i][/lime][green][i]!~./_ -]+'[/i][/green] ALL_LISTS[teal].[/teal]dat [teal]>[/teal] ALL_LISTS-bad[teal].[/teal]dat
Then you have to replace those characters only in the "bad" file :
Code:
tr -c [green][i]'[a-zA-Z0-9_[/i][/green][lime][i]\t[/i][/lime][green][i]!~./_ [/i][/green][lime][i]\n[/i][/lime][green][i]-]'[/i][/green] [green][i]'?'[/i][/green] [teal]<[/teal] ALL_LISTS-bad[teal].[/teal]dat [teal]>[/teal] ALL_LISTS-bad-question[teal].[/teal]dat
Regarding your script to copy and rename the separately :
Bash:
egrep -x [green][i]'[a-zA-Z0-9_[/i][/green][lime][i]\t[/i][/lime][green][i]!~./_ -]+'[/i][/green] ALL_LISTS[teal].[/teal]dat [teal]|[/teal] [teal]\[/teal]
[b]while[/b] [COLOR=chocolate]read[/color] goodname[teal];[/teal] [b]do[/b]
  cp -p [green][i]"${indir}${goodname}"[/i][/green] [green][i]"$goutdir"[/i][/green]
[b]done[/b]

egrep -xv [green][i]'[a-zA-Z0-9_[/i][/green][lime][i]\t[/i][/lime][green][i]!~./_ -]+'[/i][/green] ALL_LISTS[teal].[/teal]dat [teal]|[/teal] [teal]\[/teal]
[b]while[/b] [COLOR=chocolate]read[/color] badname[teal];[/teal] [b]do[/b]
  cp -p [green][i]"${indir}${badname}"[/i][/green] [green][i]"$boutdir/$( tr -c '[a-zA-Z0-9_[/i][/green][lime][i]\t[/i][/lime][green][i]!~./_ -]' '?' <<< "[/i][/green][navy]$badname[/navy][green][i]" )"[/i][/green]
[b]done[/b]

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top