How To Get Unique Values Set From A Repeating Values List

May 30, 2024 Post a Comment

I need to parse a large log file (flat file), which contains two column of values (column-A , column-B). Values in both columns are repeating. I need to find for each unique valu

Solution 1:

Perl 'one-liner' intended/expanded out so that everything fits in the window:

$ perl -F-lane '

      $hash{ $F[0] }{ $F[1] }++;
  } END {

      for my $columnA ( keys %hash ) {

          print $columnA, " - ", join( ",", keys %$hash{$columnA} ), "\n";
      }
  '

Explanation will follow if I see a concerted attempt on the part of the original poster.

Solution 2:

I would use Python dictionaries where the dictionary keys are column A values and the dictionary values are Python's built-in Set type holding column B values

defparse_the_file():
    lower = str.lower
    split = str.split
    withopen('f.txt') as f:
        d = {}
        lines = f.read().split('\n')
        for A,B in [split(l) for l in lines]:
            try:
                d[lower(A)].add(B)
            except KeyError:
                d[lower(A)] = set(B)

        for a in d:
            print"%s - %s" % (a,",".join(list(d[a])))

if __name__ == "__main__":
    parse_the_file()

The advantage of using a dictionary is that you'll have a single dictionary key per column A value. The advantage of using a set is that you'll have a unique set of column B values.

Efficiency notes:

The use of try-catch is more efficient than using an if\else statement to check for initial cases.
The evaluation and assignment of the str functions outside of the loop is more efficient than simply using them inside the loop.
Depending on the proportion of new A values vs. reappearance of A values throughout the file, you may consider using a = lower(A) before the try catch statement
I used a function, as accessing local variables is more efficient in Python than accessing global variables
Some of these performance tips are from here

Testing the code above on your input example yields:

xxxd - 4
xxxa - 1,3,2
xxxb - 2
xxxc - 3

Solution 3:

You can use this simple multimap:

classMultiMap(object):
    values = {}

    def__getitem__(self, index):
        returnself.values[index]
    def__setitem__(self, index, value):
        ifnotself.values.has_key(index):
            self.values[index] = []
        self.values[index].append(value)
    def__repr__(self):
        return repr(self.values)

See it in action: http://codepad.org/xOOrlbnf

Solution 4:

Simple Perl version:

#!/usr/bin/perluse strict;
use warnings;

my (%v, @row);

foreach (<DATA>) {
        chomp;
        $_ = lc($_);
        @row = split(/\s+/, $_);
        push( @{ $v{$row[0]} }, $row[1]);
} 

foreach (sortkeys %v) {
        print"$_ - ", join( ", ", @{ $v{$_} } ), "\n";
}

__DATA__
xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4

Did not focus on variable names. From example i see they are not case sensitive.

Solution 5:


f = """xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4"""


d = {}

for line in f.split("\n"):
    key, val = line.lower().split()
    try:
        d[key].append(val)        
    except KeyError:
        d[key] = [val]


print d

Python

Python Playground