Best way to compare arrays?

TheBuzzer · Dec 14, 2009

I am trying to figure out a way to compare multiple arrays in a fast method that will give me back what each array does not have.

So if an array is {2,3,5}
Second array is {1,3,5}
Third Array is {2,5}

First array will give back 1
Second array will give back 2
Third array will give back 1,3

Is this even possible in a fast way? or do i need to just make each array compare to every array.

CraftyChicken · Dec 14, 2009

Depending on your language, you can use a predefined library or just walk-through each element with loops. I'm sure there's a faster way, but I'm not great programmer whiz

I'm gonna take a leap and guess you're talking php. In which case array_diff() is your friend.

PTNL · Dec 14, 2009

One approach would be to create an array of all unique values across each of the individual arrays. Then check each individual array against the master lookup array to see what value(s) were not found.

keenan · Dec 14, 2009

Most languages should have built in set operations. Take the union of all the input sets, then the difference between that and the array you're comparing.

If you have to do it by hand the most efficient simple way is probably to sort both arrays (or just keep them sorted in the first place), then walk the set of all values alongside the input array; any missing values get added to the output set. If the arrays are small it might be faster to just do a linear search for all values.

TheBuzzer · Dec 14, 2009

Well i think i would just have to loop everything in the thing i am doing.

I can't just do a normal array compare because it isnt really just a array but a group of objects.

J-Will · Dec 15, 2009

What lang?

If you're comparing an array of objects, can you convert the objects to a string? or to integers? If that answer is no, then do you know what all the possible objects are? You could use a loop with booleans in that case.

TheBuzzer · Dec 15, 2009

I am using java the arrays are specialized arrays that contains strings i wanted to compare inside.

J-Will · Dec 15, 2009

Then you would probably want to create a nested for loop:
for(int i=0;i<array1.length;i++){
for(int n=0;n<array2.length;n++){
compare all the elements in array2 to the element of array1
}
}
When the compare of ALL the elements in array2 is finished comparing to array1., then will be increased to the next element and all the elements in array2 will be compared to that element. A simple If statement could be used to keep track of whether or not there is a match.
If array2[n] == array1{ arrayMatch[x] = array2[n]; x++;}

show us what you have so far so we can point you in the proper direction. I dont want to give too much away, but still give you proper tips

keenan · Dec 15, 2009

If you use a type that implements the Set interface (such as HashSet), or copy your array into one, these operations are trivial and implemented for you. You just need to make sure that Object.equals of your objects does what it should. Then use Set.removeAll(Collection) to implement the logic you're describing.

mikeblas · Dec 15, 2009

TheBuzzer said:
I am trying to figure out a way to compare multiple arrays in a fast method that will give me back what each array does not have.

So if an array is {2,3,5}
Second array is {1,3,5}
Third Array is {2,5}

First array will give back 1
Second array will give back 2
Third array will give back 1,3

I can't figure out what you mean by "does not have". None of the arrays include the number 4, so why is it not listed in what you "give back"? None of them have zero, or six, either. Your subject line says that you want to compare arrays; a comparison requires at least two things to compare. But you're only comparing one array, somehow.

keenan · Dec 15, 2009

mikeblas said:
I can't figure out what you mean by "does not have". None of the arrays include the number 4, so why is it not listed in what you "give back"? None of them have zero, or six, either. Your subject line says that you want to compare arrays; a comparison requires at least two things to compare. But you're only comparing one array, somehow.

I think what he wants is the difference between the union and one of the inputs. So If he has {2,3,4} and {5,6}, it's the difference between {2,3,4} and {2,3,4,5,6} he's interested in.

drleospaceman · Dec 16, 2009

Put them in an array list so you can use all the methods in the collection interface.

mikeblas · Dec 16, 2009

keenan said:
I think what he wants is the difference between the union and one of the inputs. So If he has {2,3,4} and {5,6}, it's the difference between {2,3,4} and {2,3,4,5,6} he's interested in.

Yeah, could be; and that would make the "get back" sets he lists consistent. Maybe when he clarifies his request, we can provide some help for him.

ameoba · Dec 16, 2009

If you don't have sets you'd probably be better off using a dictionary/hash table rather than worrying about sorting and iteratively searching the arrays.

mikeblas · Dec 16, 2009

Depends on the input. If they're a dense range, then a bit-array is hard to beat for density and efficiency.

mpoon2489 · Dec 18, 2009

If you want fast and efficient, don't do nested loops. They're going to give you a horrible O(n^2) runtimes. Like many people suggested before me, using Set is very nice, as Java Set operations are very quick. Similarly, just use a hash table since lookups take constant time, and you'll end up with something like O

runtime.

Tawnos · Dec 18, 2009

mikeblas said:
Depends on the input. If they're a dense range, then a bit-array is hard to beat for density and efficiency.

Note for the OP: Bit array is where you use the bitwise representation of a number to store whether it is present. For example, if you have numbers 0, 1, 5, and 30, you could represent all of those with a single 32 bit number (int) equivalent to 0100 0000 0000 0000 0000 0000 0001 0011 = 1073741843

It's efficient because with a few ints you can cover a relatively large range of numbers without a huge amount of memory allocation needed.

Storage in such a setup is rather easy, to pick the "bin" a particular number falls in (which int to store the bit), divide by 32. For which bit in that bin, do a bit shift of 1 by the number minus 1...

mikeblas · Dec 18, 2009

Hash tables only take constant time if there are no collisions.

mpoon2489 · Dec 18, 2009

mikeblas said:
Hash tables only take constant time if there are no collisions.

Even so, in any modern programming language, the hash code is going to be good enough that expected time should be constant.

rflcptr · Dec 18, 2009

mpoon2489 said:
Even so, in any modern programming language, the hash code is going to be good enough that expected time should be constant.

What does this mean?

ameoba · Dec 18, 2009

rflcptr said:
What does this mean?

I think he's saying that most built-in hashtable libraries use pretty good hashing functions that do a reasonable job at minimize the occurrence of collisions for most data sets of reasonable size.

mikeblas · Dec 18, 2009

mpoon2489 said:
Even so, in any modern programming language, the hash code is going to be good enough that expected time should be constant.

No hash function is perfect, and even with a theoretically perfect hash function, given enough keys, the hash will have collisions.

mpoon2489 · Dec 19, 2009

I mean yes, they aren't perfect hash functions, but expected time is still constant.

Hash functions are tailored so that in the average case, they should produce a hashcode that doesn't collide with another object's hashcode. Not only does this help the constant time case, I don't think many hash table implementations use linked list for collisions, so even in the worst worst case, you would beat O

still.

Also for perfect hash functions, you know the set of keys, and you can easily set the size of the hash table beforehand.

But on average for a real-life hash table, you get Theta(1) performance.

mikeblas · Dec 19, 2009

mpoon2489 said:
Also for perfect hash functions, you know the set of keys, and you can easily set the size of the hash table beforehand.

If you do; we don't.

nameless_centurian · Dec 19, 2009

I mean yes, they aren't perfect hash functions, but expected time is still constant.

Hash functions are tailored so that in the average case, they should produce a hashcode that doesn't collide with another object's hashcode. Not only does this help the constant time case, I don't think many hash table implementations use linked list for collisions, so even in the worst worst case, you would beat O still.

Also for perfect hash functions, you know the set of keys, and you can easily set the size of the hash table beforehand.

But on average for a real-life hash table, you get Theta(1) performance.

Okay. Here is a string hash table. Please give me the constant-time hash algorithm for an arbitrary-length string.

Hint: it doesn't exist. Any data type like a string (for instance, a variable-length array) suffers from the same problem. The only way to achieve a constant-time hash for such structures is to bound the input key-size.

Additionally, it doesn't really matter what the structure at each hashed location is, you cannot expect a constant-time hash for a reasonably large set of keys. When you start getting collisions, you need to increase the table size and perhaps even dynamically change the hash function. In the best case, you might AMORTIZE to O(1), but when you actually need to re-size your table and move your data, it's going to be horridly slow.

So, I'm really not understanding your arguments. In all but the simplest cases, you cannot reasonably expect a constant-time hash lookup.

mpoon2489 · Dec 19, 2009

@mikeblas: this is pretty OT, but how else would you have a realistic perfect hash function without knowing the keyset? Having a perfect hash function means you know all of the input beforehand.

Say, for example we have 100 objects that we have to hash, and we know we will only have these unique 100 objects, ever. We make a perfect hash function by just giving these objects an index 1, 2, ... 100. Then we know we can make the hash table to be of size 100.

@nameless: I think you're confusing what I said. My post was in reference to lookups, which are constant time fast. Preprocessing and hashing are NOT constant time; they are polynominal.

nameless_centurian · Dec 19, 2009

@nameless: I think you're confusing what I said. My post was in reference to lookups, which are constant time fast. Preprocessing and hashing are NOT constant time; they are polynominal.

What is polynomial?? The hash should be LINEAR with respect to the key length, or else you are using the wrong indexing structure.

Additionally, it's not quite fair to consider string hashing (or variable-length array hashing) in two stages: "preprocessing" and lookup. You CANNOT lookup without "preprocessing." To lookup without running the hash function is to magically know the location of the value based on the key without doing any work. Otherwise, yes you are correct. If you know where the value is located in memory, then it is constant theoretic time to access it.

An example: a hash table contains the keys {"Fred", "Fran", "Franklin"} mapped to the values {"Smith", "Jones", "Jackson"}, respectively. Later on, a new user accesses our program for the first time. He enters a key value, say it's "disease". How can I tell in constant time if "disease" is in the hash table? I can't. The time this will take depends on the length of "disease" (or a similar value that may be much longer or much shorter). If the hash computation is expensive for each sub-element of the key ('d', 'i', 's', 'e', 'a', 's', 'e'), then this makes a big deal. In the case of a string hash, this is usually going to be a constant-time operation, but you need to to it N times, N = length of string. So before I can access the exact location of the value in constant time, I need to compute something that is at least linear time. This obviously assumes a perfect hash, which doesn't exist. For a sufficient number of keys, there WILL be collisions.

Perhaps it would be better for you to lay out your assumptions rather than for us to keep disagreeing with/correcting you on what very well just me a poor statement of the exact circumstances you are operating under.

mikeblas · Dec 19, 2009

You don't get constant time on average. The time you get will depend on collisions and how your hash resolved them. Constant time is your lower bound; depending on the collision resolution mechanism, the upper bound of a probe into the hash might be a lot worse. What hash algorithm implements inserts in polynomial time?

The assumptions are already laid-out here in the thread. The funny thing is, we don't know most of them: we don't know the cardinality of the sets, we don't know the number of sets, and we're not even sure what "compare" means. We've asked TheBuzzard some of those questions, but he hasn't responded.

mpoon2489 · Dec 19, 2009

nameless_centurian said:
What is polynomial?? The hash should be LINEAR with respect to the key length, or else you are using the wrong indexing structure.

Additionally, it's not quite fair to consider string hashing (or variable-length array hashing) in two stages: "preprocessing" and lookup. You CANNOT lookup without "preprocessing." To lookup without running the hash function is to magically know the location of the value based on the key without doing any work. Otherwise, yes you are correct. If you know where the value is located in memory, then it is constant theoretic time to access it.

Linear functions are polynomial with degree 1. Why isn't it fair? What if the OP's objects are already hashed on some server? What if preprocessing time has a runtime that is quicker than the wanted algorithm's runtime? It's important to distinguish the runtimes of preprocessing and lookup so that you can use each effectively.

Also in your example, the keys would already be hashed as part of the preprocessing.

nameless_centurian said:
Perhaps it would be better for you to lay out your assumptions rather than for us to keep disagreeing with/correcting you on what very well just me a poor statement of the exact circumstances you are operating under.

If you look at my original post, I only stated that hash table lookups are constant time. Preprocessing takes linear time, and if the OP wants to compare these two arrays, which is going to take linear time anyway, using the hash table is more efficient versus using nested loops.

nameless_centurian · Dec 20, 2009

mpoon2489 said:
Linear functions are polynomial with degree 1. Why isn't it fair?

Ok. You are correct; you just confused me by abandoning convention.

What if the OP's objects are already hashed on some server? What if preprocessing time has a runtime that is quicker than the wanted algorithm's runtime? It's important to distinguish the runtimes of preprocessing and lookup so that you can use each effectively.

I gave you the example of strings. You said that "in the average case" that hash tables provided constant-time lookup. I explained to you that there is no such guarantee.

Additionally, if the objects are already hashed on some server, what is the use-case? If they are already hashed, then you could have a pointer or reference to the exact spot in memory where the object is located, and another structure would be more appropriate. The point of a hash table is to provide an index with quick retrieval. Thus, I count your "preprocessing" and lookup together.

Regardless, I understand your point; it just makes little sense to me.

Also in your example, the keys would already be hashed as part of the preprocessing.

In my example, the original keys would be hashed when inserted, yes.

When you want to test set membership for a BRAND-NEW string that comes in to the program (like "disease"), then you need to hash that new string because you CANNOT store every permutation of the alphabet, already hashed in memory. Every new string you test for membership in the hash table must be hashed with a linear-time hash. What are you not understanding?

If you look at my original post, I only stated that hash table lookups are constant time. Preprocessing takes linear time, and if the OP wants to compare these two arrays, which is going to take linear time anyway, using the hash table is more efficient versus using nested loops.

More efficient than running the loop over the entire list of entries? Sure. So is a binary search if your string values are stored in an array. I wouldn't consider using this indexing structure for anything but simple dictionaries and short strings, though.

However, you could index such items in a suffix tree or Aho-Corasick Trie, and therefore GUARANTEE a search time linear to the length of the input string (and memory consumption would be bounded either linearly or polynomially, depending upon your algorithm). If performance is this important for you, why would you use a linear-time hash, coupled with a possible super-logarithmic lookup, not to mention the horrendous memory waste you get by necessarily using a large portion of memory for hashing so that you don't have too many collisions?

There are a number of domains where your suggested data structure (using the same membership testing scenario the OP asked about) is not optimal, or even desired. I will name you three: computational biology, computer forensics, and digital signal processing.

If you are unfamiliar with seminal data structures pertinent to those topics, please look to Dan Gusfield's book, "Algorithms on Strings, Trees, and Sequences" out of Cambridge. It's a fascinating book about string processing, and any array-based data structure would work with these algorithms.

mikeblas · Dec 20, 2009

nameless_centurian said:
Ok. You are correct; you just confused me by abandoning convention.

And constants are polynomial functions with degree zero.

calebb · Dec 21, 2009

mikeblas said:
And constants are polynomial functions with degree zero.

First derivatives of first order polynomials are also constants.

And this line of thinking is intriguing; I wish to subscribe to its newsletter.

nameless_centurian · Dec 21, 2009

And constants are polynomial functions with degree zero.

Indeed.

Best way to compare arrays?

HACK THE WORLD!

Limp Gawd

Supreme [H]ardness

2[H]4U

HACK THE WORLD!

[H]ard|Gawd

HACK THE WORLD!

[H]ard|Gawd

2[H]4U

[H]ard|DCer of the Month - May 2006

2[H]4U

Gawd

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Weaksauce

2[H]4U

[H]ard|DCer of the Month - May 2006

Weaksauce

Supreme [H]ardness

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Weaksauce

[H]ard|DCer of the Month - May 2006

Gawd

Weaksauce

Gawd

[H]ard|DCer of the Month - May 2006

Weaksauce

Gawd

[H]ard|DCer of the Month - May 2006

[H]F Junkie

Gawd