Unix Snippets for Handling Duplicate Keys in Delimited Input Files

Recently I had a problem where input data files for an interface contained duplicate keys that caused the interface to fail. Sometimes the complete lines were duplicated, but in other cases only the key fields were duplicated with differing non-key attribute values. I wanted, firstly, to report on the problem lines, and, secondly, to clean up the files as far as possible. For lines that are fully duplicated, it's easy enough to deduplicate, and the following command does that, producing a sorted file with unique lines:

sort -u input.dat > output.dat

The unix uniq comand leaves the lines in a file in the original order, but only deduplicates adjacent sets of lines.

It's more difficult when keys are duplicated but non-key fields differ and we do not know which line to retain. In this case, we wanted to just delete those lines and report them to the source system for correction.

I wrote some snippets of unix code to report on duplicate-key lines and clean up the files, using unix utilities such as sort, awk, grep. This article lists the code snippets with outputs for a test data file.

Update 22 December 2014 I added a section 'Using awk Hash Variables for Group Counting'.

Test Scenarios
The first test data file has two leading key fields, followed by four attribute fields, with pipe delimiters. There is one pair of complete duplicates, a set of three lines having the same key but varying attributes, and two unique lines. Because we will be using grep for pattern matching, an obvious potential bug would be to match strings that are not in the first two positions, so I have included one of the duplicate key pairs as fields 4 and 5 to trap that if it occurs.

test_dups_in_1_2.dat with the 2 leading fields key

k91|k92|a1|k41|k42|a
k11|k12|a1|k41|k42|b
k41|k42|a4x|k41|k42|c
k21|k22|a2|k41|k42|d
k41|k42|a4|k41|k42|c
k41|k42|a4|k41|k42|c
k11|k12|a1|k41|k42|b

Here are the functions applied to the initial test set:

  • List the distinct keys that have duplicates
  • List all lines for duplicate keys
  • Strip all lines with duplicate keys from the file, returning unsorted lines
  • Strip the inconsistent duplicates from the file, returning unique sorted lines

The solutions rely on sorting the lines, taking advantage of the fact that the key fields are leading. If the key fields are not leading a little more work is required, and I repeated the fourth function for this case, after swapping fields 2 and 3, so that then key fields are 1 and 3.

  • Strip the inconsistent duplicates from the file, returning unique sorted lines - key fields not leading

List the distinct keys that have duplicates
Unix Code

sort $INFILE1_2 | awk -F"|" ' $1$2 == last_key && !same_key {print $1"|"$2; same_key=1} $1$2 != last_key {same_key=0; last_key=$1$2}'

Output

k11|k12
k41|k42

List all lines for duplicate keys
Unix Code

sort $INFILE1_2 | awk -F"|" ' 
$1$2 == last_key && !same_key {print last_line; same_key=1} 
$1$2 == last_key {print $0; same_key=1} 
$1$2 != last_key {same_key=0; last_key=$1$2; last_line=$0}'

Output

k11|k12|a1|k41|k42|b
k11|k12|a1|k41|k42|b
k41|k42|a4|k41|k42|c
k41|k42|a4|k41|k42|c
k41|k42|a4x|k41|k42|c

I think this should work for any number of key fields as long as they are all at the start, if you replace $1$2 by the relevant field positions. If the key fields are not all at the start then you have to pre/post-process as I show later.

Strip all lines with duplicate keys from the file, returning unsorted lines
This code creates a copy of the input file with the lines for the duplicate keys stripped out, using the previous awk code to generate a list that we then loop over doing a "grep -v" on the copied file.

Unix Code

dup_list=`sort $INFILE1_2 | awk -F"|" '
$1$2 == last_key && !same_key {print $1"|"$2; same_key=1}
$1$2 != last_key {same_key=0; last_key=$1$2}'`
cp $INFILE1_2 $OUTFILE1_2; for i in $dup_list; do grep -v ^$i $OUTFILE1_2 > $WRKFILE1; mv $WRKFILE1 $OUTFILE1_2; done
cat $OUTFILE1_2

Output

k91|k92|a1|k41|k42|a
k21|k22|a2|k41|k42|d

Strip the inconsistent duplicates from the file, returning unique sorted lines
Here is a variation on the above, where we keep one line for duplicates where all the attributes are also duplicated, but drop the duplicates where the attributes vary and so we cannot clean them. We do this by apply the "-u" flag to the initial sort, and using that output also as the basis for the stripping part.

Unix Code

dup_list=`sort -u $INFILE1_2 | awk -F"|" '
$1$2 == last_key && !same_key {print $1"|"$2; same_key=1}
$1$2 != last_key {same_key=0; last_key=$1$2}'`
sort -u $INFILE1_2 > $OUTFILE1_2; for i in $dup_list; do grep -v ^$i $OUTFILE1_2 > $WRKFILE1; mv $WRKFILE1 $OUTFILE1_2; done
cat $OUTFILE1_2

Output

k11|k12|a1|k41|k42|b
k21|k22|a2|k41|k42|d
k91|k92|a1|k41|k42|a

Test Scenario with Non-leading Key Fields
A simple awk command creates a a copy of the original test data file, but with fields 2 and 3 swapped.

Unix Code

awk -F"|" '{print $1"|"$3"|"$2"|"$4"|"$5"|"$6}' $INFILE1_2 > $INFILE1_3

test_dups_in_1_3.dat with fields 1 and 3 key

k91|a1|k92|k41|k42|a
k11|a1|k12|k41|k42|b
k41|a4x|k42|k41|k42|c
k21|a2|k22|k41|k42|d
k41|a4|k42|k41|k42|c
k41|a4|k42|k41|k42|c
k11|a1|k12|k41|k42|b

Strip the inconsistent duplicates from the file, returning unique sorted lines - key fields not leading
This variation caters for non-leading key fields by first swapping the fields so that the key fields now lead, applying similar processing to the previous case, then swapping the fields back.

Unix Code

awk -F"|" ' {print $1"|"$3"|"$2"|"$4"|"$5"|"$6}' $INFILE1_3 | sort -u > $WRKFILE1
dup_list=`awk -F"|" ' $1$2 == last_key && !same_key {print $1"|"$2; same_key=1} $1$2 != last_key {same_key=0; last_key=$1$2}' $WRKFILE1`
for i in $dup_list; do grep -v ^$i $WRKFILE1 > $WRKFILE2; mv $WRKFILE2 $WRKFILE1; done
awk -F"|" '  {print $1"|"$3"|"$2"|"$4"|"$5"|"$6}' $WRKFILE1 > $OUTFILE1_3

Output

k11|a1|k12|k41|k42|b
k21|a2|k22|k41|k42|d
k91|a1|k92|k41|k42|a

Test script and output file attached.

Unix Snippets for Duplicate Keys

Using awk Hash Variables for Group Counting
Here is a script that counts the number of instances of each key, and also the number of distinct lines within each key. Two hash arrays are used to do the counting within awk; in the expected use case the input file may be quite large but the number of duplicated keys, which is the array cardinality, should be small. The script goes on to reconcile the number of lines in the cleaned output file with the input file and the counts in the duplicates file.

I added some extra lines to the input file and pass the file name as a parameter to the script.

if [ $# != 1 ] ; then
	echo Usage: $0 [Data file]
	exit 1
fi
echo Running $0 $1...
INFILE=$1
OUTFILE_DPK=$INFILE.dpk # duplicate keys with: number of lines; number of distinct lines
OUTFILE_CLN=$INFILE.cln # cleaned file - inconsistent duplicate keys removed, and rest unique

WRKFILE=/tmp/temp1.txt

# List the distinct keys that have duplicates with: number of lines; number of distinct lines

sort $INFILE | awk -F"|" '
$1$2 == last_key {num[$1"|"$2]++; if (last_line != $0) diff[$1"|"$2]++} 
$1$2 != last_key {last_key=$1$2}
{last_line=$0}
END {for (key in num) print key, ++num[key], ++diff[key]}' > $OUTFILE_DPK

# Strip the inconsistent duplicates from the file, returning sorted lines

sort -u $INFILE > $OUTFILE_CLN
dup_list=$(awk -F"|" '
$1$2 == last_key && !same_key {print $1"|"$2; same_key=1} 
$1$2 != last_key {same_key=0; last_key=$1$2}' $OUTFILE_CLN)
for i in $dup_list; do grep -v ^$i $OUTFILE_CLN > $WRKFILE; mv $WRKFILE $OUTFILE_CLN; done

function line_print {
	echo '***' # 's needed else * translated
	echo $1
}
function line_count {
	line_print "$1" # "s needed else first word only passed
	wc -l $2
	retval=$(wc -l $2|cut -d" " -f1)
}

line_print 'Line counts...'

line_count "Input data file..." $INFILE; n_inp=$retval
line_count "List of duplicate keys..." $OUTFILE_DPK; n_dup=$retval
line_count "Cleaned data file, i.e. unique records, and inconsistent duplicates removed..." $OUTFILE_CLN; n_cln=$retval

# awk sums the second column of duplicate keys file to get total lines associated
n_dup_lines=$(awk '{sum += $2} END {print sum}' $OUTFILE_DPK)

# grep counts the number of keys that have 1 as last character preceded by space, being the number of keys with only 1 distinct line
n_dup_keys_cons=$(grep -c " 1$" $OUTFILE_DPK)

line_print 'Input file...'
cat $INFILE

line_print 'Duplicated keys with number of lines and number of distinct lines within key...'
cat $OUTFILE_DPK

line_print 'Cleaned file...'
cat $OUTFILE_CLN

line_print 'Reconciliation...'

echo Number of lines in cleaned file, which is $n_cln, should equal the following computed value...
echo "(Input lines - duplicate key lines + consistent duplicate keys) = " $n_inp - $n_dup_lines + $n_dup_keys_cons = $((n_inp-n_dup_lines+n_dup_keys_cons))

Output from the script

Running ./Dup_Keys.ksh test_dups_in_1_2.dat...
***
Line counts...
***
Lines in input data file...
15 test_dups_in_1_2.dat
***
Lines in duplicate keys file...
3 test_dups_in_1_2.dat.dpk
***
Lines in cleaned data file (unique records, and inconsistent duplicates removed)...
4 test_dups_in_1_2.dat.cln
***
Input file...
k91|k92|a1|k41|k42|a
k11|k12|a1|k41|k42|b
k41|k42|a4x|k41|k42|c
k21|k22|a2|k41|k42|d
k41|k42|a4|k41|k42|c
k41|k42|a4|k41|k42|c
k41|k42|a4|k41|k42|c
k41|k42|a4|k41|k42|c
k41|k42|a4|k41|k42|c
k41|k42|a4|k41|k42|z
k41|k42|a4|k41|k42|c
k51|k52|a4|k41|k42|e
k51|k52|a4|k41|k42|e
k51|k52|a4|k41|k42|e
k11|k12|a1|k41|k42|b
***
Duplicated keys with number of lines and number of distinct lines within key...
k11|k12 2 1
k51|k52 3 1
k41|k42 8 3
***
Cleaned file...
k11|k12|a1|k41|k42|b
k21|k22|a2|k41|k42|d
k51|k52|a4|k41|k42|e
k91|k92|a1|k41|k42|a
***
Reconciliation...
Number of lines in cleaned file, which is 4, should equal the following computed value...
(Input lines - duplicate key lines + consistent duplicate keys) =  15 - 13 + 2 = 4






A Generic Unix Script for Uploading eBusiness Concurrent Programs

I have posted a couple of articles recently on XML Publisher report development within Oracle eBusiness applications (A Design Pattern for Oracle eBusiness Audit Trail Reports with XML Publisher and Design Patterns for Database Reports with XML Publisher and Email Bursting). These reports are of one of several types of batch program, or concurrent program that can be defined within Oracle eBusiness.

Oracle eBusiness uses a number of metadata tables to store information on the programs, and in release 11.5 Oracle introduced a Unix utility called FNDLOAD to download the metadata to formatted text files to allow them to be uploaded via the same utility into downstream environments. At that time batch reports were generally developed in the Oracle Reports tool and typically there might only be two text files (known as LDT files after their extension), for the program and for associated value sets, and maybe one for request groups (which control access to the reports). The executable report file, the RDF, would just be copied to the target environment server directory. I wrote a set of wrapper scripts for the utility to streamline its use and to deal with a number of issues, including handling of audit fields, with a structure consisting of a separate pair of scripts for download and upload of each type of LDT file. I published these on Scribd in July 2009.

In working more recently with Oracle's successor reporting tool, XML Publisher, I found that the number of objects involved in installation has increased substantially. As well as the LDT files there are also now XML configuration files and RTF (usually) templates, uploaded via a Java utility, XDOLoader. Installation also involves at least one PL/SQL package. For example the email version of my model reports (see the second link above) had 11 configuration files. For this reason, I decided to create a single script that would copy, upload and install all objects required for a concurrent program of any type, and I describe the script in this article.

Here is my original Scribd article, describing the issues mentioned, and with the original individual upload and download scripts.

Loading...

The new script is here XX_Install_XMLCP_ksh, and here is the MD120 from the model article that uses it.

Loading...

Directory Structure

The script operates on an input TAR file containing all the necessary files. The TAR file is called prog.tar after the program short name prog, and contains a directory also called prog with all the installation files. The installer uses relative paths and assumes that the TAR file, with the installer, is placed in a directory below the custom top directory, say $XX_TOP/rel. All loader files are copied to $XX_TOP/import and SQL files to $XX_TOP/install/sql.

After installation, a new directory will remain with all the installation files for reference, $XX_TOP/rel/prog.

Program Structure

The internal call structure of the Unix script is shown below.
Install CP - CSD

The script operates on an input TAR file containing all the necessary files.

After extracting the TAR file, the script has local subroutines that can be divided into four categories, as above:

  1. Preliminaries - parameter processing, file moving and validation
  2. FNDLOAD uploads - uploading the LDT files
  3. XDOLoader uploads - uploading the data and any bursting XML files, and all layout templates present
  4. SQL installations - installing any SQL script present, and the package spec and body files

The following table gives a few notes on the main program and preliminary subroutines.
Preliminaries Summary
Prelims Table
The following table gives a few notes on the upload and SQL subroutines.
Upload and SQL Summary
Install Uploads Table
The upload subroutines all have SQL queries for verification, and here is a sample log file from the script: XX_ERPXMLCP_EM_log

The remainder of the article lists the queries with diagrams and examples of output.

Upload Subroutines

upload_ag

Validation Query

SELECT app_g.application_short_name "App G", fag.group_name "Group",
      fag.description "Description",
      app_t.application_short_name "App T", ftb.table_name "Table",
      fcl.column_name "Column"
  FROM fnd_audit_groups fag
  JOIN fnd_application app_g
    ON app_g.application_id = fag.application_id
  JOIN fnd_audit_tables fat
    ON fat.audit_group_app_id = fag.application_id
   AND fat.audit_group_id = fag.audit_group_id
  JOIN fnd_application app_t
    ON app_t.application_id = fat.table_app_id
  JOIN fnd_tables ftb
    ON ftb.application_id = fat.table_app_id
   AND ftb.table_id = fat.table_id
  JOIN fnd_audit_columns fac
    ON fac.table_app_id = fat.table_app_id
   AND fac.table_id = fat.table_id
  JOIN fnd_columns fcl
    ON fcl.application_id = fac.table_app_id
   AND fcl.table_id = fac.table_id
   AND fcl.column_id = fac.column_id
 WHERE fag.last_update_date     = To_Date ('$sysdate', 'YYYY/MM/DD')
   AND fac.schema_id            = 900
ORDER BY app_g.application_short_name, fag.group_name, 
         app_t.application_short_name, ftb.table_name,
         fcl.column_name;

QSD

Install CP - AG - 1

Example Output
No example available here, but the headings are:
"App G", "Group", "Description", "App T", "Table", "Column"

upload_ms

Validation Query

SELECT mes.message_name "Name", mes.message_text "Text"
  FROM fnd_new_messages mes
 WHERE mes.last_update_date     = To_Date ('$sysdate', 'YYYY/MM/DD')
 ORDER BY 1;

QSD

Install CP - Message

Example Output

Name                     Text
--------------------- ---------------------------------------------
XX_ERPXMLCP_EM_NONXML Non-XML Concurrent Program &PROGAPP

upload_vs

Validation Query

SELECT fvs.flex_value_set_name "Value Set", Count(fvl.flex_value_set_id) "Values"
  FROM fnd_flex_value_sets fvs, fnd_flex_values fvl
 WHERE fvs.last_update_date     = To_Date ('$sysdate', 'YYYY/MM/DD')
   AND fvl.flex_value_set_id(+) = fvs.flex_value_set_id
 GROUP BY fvs.flex_value_set_name;

QSD

Install CP - VS

Example Output

Value Set             Values
--------------------- ------
XX_PROGS                   0
XX_APPNAME_ID              0

upload_cp

Validation Query

SELECT prg.user_concurrent_program_name || ': ' || prg.concurrent_program_name "Program", fcu.column_seq_num || ': ' || fcu.end_user_column_name "Parameter"
  FROM fnd_concurrent_programs_vl               prg
  LEFT JOIN fnd_descr_flex_column_usages      fcu
    ON fcu.descriptive_flexfield_name         = '\$SRS\$.' || prg.concurrent_program_name
   AND fcu.descriptive_flex_context_code      = 'Global Data Elements'
 WHERE prg.concurrent_program_name              = '$cp_name'
 ORDER BY 1, 2;

QSD

Install CP - CP

Example Output

Program                                    Parameter
------------------------------------------ ------------------------
XX Example XML CP (Email): XX_ERPXMLCP_EM  100: Cc Email
                                           10: Application
                                           20: From Program
                                           30: To Program
                                           40: From Date
                                           50: To Date
                                           60: From Parameter Count
                                           70: To Parameter Count
                                           80: Override Email
                                           90: From Email

upload_rga

Validation Query

SELECT rgp.request_group_name "Request Group",
       app.application_short_name "App"
  FROM fnd_concurrent_programs          cpr
  JOIN fnd_request_group_units          rgu
    ON rgu.unit_application_id          = cpr.application_id
   AND rgu.request_unit_id              = cpr.concurrent_program_id
  JOIN fnd_request_groups               rgp
    ON rgp.application_id               = rgu.application_id
   AND rgp.request_group_id             = rgu.request_group_id
  JOIN fnd_application                  app
    ON app.application_id               = rgp.application_id
 WHERE cpr.concurrent_program_name      = '$cp_name'
 ORDER BY 1;

QSD

Install CP - RGA

Example Output

Request Group                  App
------------------------------ ----------
System Administrator Reports   FND

upload_dd

Validation Query

SELECT xdd.data_source_code "Code", xtm.default_language "Lang", xtm.default_territory "Terr"
  FROM xdo_ds_definitions_b xdd
  LEFT JOIN xdo_templates_b xtm
    ON xtm.application_short_name       = xdd.application_short_name
   AND xtm.data_source_code             = xdd.data_source_code
 WHERE xdd.data_source_code             = '$cp_name'
 ORDER BY 1, 2, 3;

QSD

Install CP - DD

Example Output

Code                 Lang Terr
-------------------- ---- ----
XX_ERPXMLCP_EM       en   US

upload_all_temps

Validation Query

SELECT xdd.data_source_code "Code", 
        xlb_d.file_name "Data Template",
        xlb_b.file_name "Bursting File",
        xtm.template_code "Template",
        xlb.language "Lang", 
        xlb.territory "Terr",
        xlb.file_name || 
               CASE
               WHEN xlb.language = xtm.default_language AND
                    xlb.territory = xtm.default_territory
               THEN '*' END "File"
  FROM xdo_ds_definitions_b             xdd
  LEFT JOIN xdo_lobs                    xlb_d
    ON xlb_d.application_short_name     = xdd.application_short_name
   AND xlb_d.lob_code                   = xdd.data_source_code
   AND xlb_d.lob_type                   = 'DATA_TEMPLATE'
  LEFT JOIN xdo_lobs                    xlb_b
    ON xlb_b.application_short_name     = xdd.application_short_name
   AND xlb_b.lob_code                   = xdd.data_source_code
   AND xlb_b.lob_type                   = 'BURSTING_FILE'
  LEFT JOIN xdo_templates_b             xtm
    ON xtm.application_short_name       = xdd.application_short_name
   AND xtm.data_source_code             = xdd.data_source_code
  LEFT JOIN xdo_lobs                    xlb
    ON xlb.application_short_name       = xtm.application_short_name
   AND xlb.lob_code                     = xtm.template_code
   AND xlb.lob_type                     LIKE 'TEMPLATE%'
 WHERE xdd.data_source_code             = '$cp_name'
   AND xdd.application_short_name       = '$app'
 ORDER BY 1, 2, 3, 4;

QSD

Install CP - Template

Example Output

Code            Data Template        Bursting File          Template             Lang Terr File
--------------- -------------------- ---------------------- -------------------- ---- ---- -------------------------
XX_ERPXMLCP_EM  XX_ERPXMLCP_EM.xml   XX_ERPXMLCP_EM_BUR.xml XX_ERPXMLCP_EM       en   US   XX_ERPXMLCP_EM.xsl*
                                                                                           XX_ERPXMLCP_EM.rtf*
                                                            XX_ERPXMLCP_EM_XML   en   US   XX_ERPXMLCP_EM_XML.xsl*
                                                                                           XX_ERPXMLCP_EM_XML.rtf*