I am trying to count the number of matching elements between two large Powershell arrays/collections generated by capturing Active Directory (“AD”) data.
However, it is very slow.
Concept
Note: feel free to skip ahead to the Application section.
Current concept
I am looking to count the number of matching instances of
- all elements of a subset of array A (all elements of A are unique, and ordered)
- any element of array B (may contain duplicates).
Suppose array A contains [-30, -2, 1, 3, 4, 6, 20, 50] and the subset of interest is all positive numbers, and array B contains [50, 6, 80, 10, -2, 30, 6, 6, 50]. Then the correct count would be 2 (6 & 50 appear in both arrays; -2 is ignored).
Note that in the real implementation (see below):
- A and B both contain strings (not numbers), and are large; and
- A and B may be arrays, or they may be
Collections
of user objects produced byGet-ADUser
.
Alternative concept
Alternatively, I could easily create a new array a = [1, 3, 4, 6, 20, 50] comprising the subset of A of interest and extract an ordered list of unique entries from B to form b = [-2, 6, 10, 30, 50, 80].
Then I just need to add up the number of elements appearing in both a and b (correct answer is still 2).
Basic description of algorithm
Array A has roughly 450,000 string elements, of which roughly 400,000 are of interest.
Array B has roughly 20,000 string elements, comprising roughly 19,000 unique values.
Currently I iterate through all elements of array A, I use if
to check whether the element is in the subset of interest, and, if so, then I test whether the element appears in array B using the -contains
operator.
Problem with algorithm
This takes over an hour to run.
Possibilities to speed up algorithm
I wondered whether it would be faster to iterate through unique elements of array B (in other words, the elements of array b), and test whether they appear in array A (or perhaps array a). But I haven’t tried it, because logically it seems as though it should take roughly the same amount of time.
Alternatively, rather than using basic arrays, is there some cleverer object like an OrderedList
class that would be faster to search?
If there’s no OrderedList
data type, then would it be practical to split the data into subarrays myself? For example, I could split array b into beta_1 = [-2, 6, 10] and beta_2 = [30, 50, 80]. Thus, if the selected element of array A has a value less than 20 I will look for it in array beta_1, and otherwise in array beta_2.
In the real problem I’m dealing with strings, and arrays with thousands of elements, so perhaps one subarray per initial (capitalised) letter of the alphabet (beta_A, beta_B, beta_C, …, beta_Z) would be apt. I wouldn’t want to have to hard code a switch statement to choose one of 26 subarrays, so — if necessary — I’d rather index into a collection of arrays: beta[0] rather than beta_A, through to beta[25] rather than beta_Z. And I would anticipate using the code value of each capital letter to select the appropriate one.
Application
Scenario
Some people are recorded in AD in both of the organisational units: Permanent
and Visitors
. But with different details recorded in each OU (so I cannot match on telephone, address, email, etc.). Nevertheless, as each pair of such records relate to a single individual person, it should be reasonably informative to match on full name. (In reality a small fraction may change their name, or provide a different version of their name.) Full name is stored in the Description
attribute (aside from a scattering of data entry errors).
The ultimate intention is to calculate statistics on such multiple records that exist for certain individuals.
Example:
$visitors.Description
= [Daniel Dingo, Betty Butternut, Adam Aardvark, Betty Butternut, Cathy Cabbage, Enid Endive]$permanents.Description
= [Enid, Enid Endive, Betty Butternut, Cathy Cabbage, Fred Flamingo, Cathy Cabbage]
I parse $visitors.Description
to tally frequency, by which it is ordered (descending):
$descriptions
= [(Betty Butternut, 2); (Adam Aardvark, 1); (Cathy Cabbage, 1); (Daniel Dingo, 1); (Enid Endive, 1)]
I skip Betty Butternut (appears more than once in $descriptions
), and then check whether the other $numUnique
=4 names (Adam Aardvark, Cathy Cabbage, Daniel Dingo, or Enid Endive) appear in $permanents.Description
.
The number of matches in this example would be $numMatching
=2 (corresponding to Cathy Cabbage and Enid Endive); it doesn’t matter if an element of $descriptions
appears in $permanents.Description
more than once, that’s still just counted as one match. This is ultimately reported as 2 out of 4 being matched.
Current code
Here’s a trimmed & anonymised version of my actual code.
# Acquire active visitors' data.
$aGroup = "Visitors"
$sb = "OU="+ $aGroup +",OU=Active,OU=Acme-Users,DC=acmeauth,DC=acme,DC=com"
$visitors = Get-ADUser -Filter * -Properties Description -SearchBase $sb | select Name, Description
# Sort Description parameter for visitors by frequency.
$descriptions = $visitors | Group-Object {$_.Description} -NoElement | Sort-Object -Property Count -Descending
# Acquire active permanent staff data.
$aGroup = "Permanent"
$sb = "OU="+ $aGroup +",OU=Active,OU=Acme-Users,DC=acmeauth,DC=acme,DC=com"
$permanents = Get-ADUser -Filter * -Properties Description -SearchBase $sb | select Name, Description
# Iterate through _non-duplicated_ visitor records to find matching permanent staff Description parameters and tally.
$numUnique = 0
$numMatching = 0
$maxIts = $descriptions.count
for ($index = 0; $index -lt $maxIts; $index++)
{
$person = $descriptions[$index]
if ($person.count -eq 1) {
# This person has a unique Description among the (active) visitor cohort.
$numUnique++
if ($permanents.Description -contains $person.name) {
# This person has a matching Description among the (active) permanent staff cohort.
$numMatching++
}
}
}
In the real application:
$visitors.Description
has roughly 450,000 string elements, of which roughly 400,000 are of interest;$permanents.Description
has roughly 20,000 string elements, comprising roughly 19,000 unique values; and$numMatching
is about 5000.
Future code
It turns out that the Description
field has data entry errors for a scattering of records (e.g. “Enid” in the example under Scenario), so I am planning to modify the code to instead use a concatenation of GivenName
and Surname
parameters (with a space character in between), assigned to arrays as simple strings using foreach
.
In a nutshell
How can I optimise this code?
In particular, how can I get it to run faster?
Possible improvements
Tinkering with algorithm
Implement one of more of the conceptual modifications described under Alternative concept and Possibilities to speed up algorithm.
Parallel computation
Maybe a simple way would be to get the most computationally demanding & repetitive part(s) of the code to run in parallel??
I haven’t looked into this in detail, but so far it’s a little confusing: for Powershell 5.1 a post by js2010 on SO indicates a few alternatives, including “workflows” with a foreach -parallel
command. However, Microsoft says “There is currently no parallel support for the foreach
keyword.” Maybe the latter statement was because the support was removed, so it’s not “current” support??
And I don’t know for sure whether it’s feasible for this application to run in parallel, or (if so) which of the various alternatives would be best (easiest to implement would be my priority, if performance is similar), and whether it’d help much.
Other ideas
Anything in this SO thread relevant? (Excluding the alternative to use something outside of Powershell.)
System
There is a reasonably powerful workstation that this job can run on.
- Windows 10 Pro for Workstations.
- Powershell 5.1.
I do not want to put any unnecessary load on the AD server.
Specific to your example:
$visitor = ConvertFrom-Csv @'
Description
Daniel Dingo
Betty Butternut
Adam Aardvark
Betty Butternut
Cathy Cabbage
Enid Endive
'@
$Permanents = ConvertFrom-Csv @'
Description
Enid
Enid Endive
Betty Butternut
Cathy Cabbage
Fred Flamingo
Cathy Cabbage
'@
I am not sure why you remove the duplicate descriptions completely from the $Visitors
(I find this a weird constrain), anyways:
$VisitorHash = @{}
foreach ($Description in $Visitor.Description) { $VisitorHash[$Description]++ }
$VisitorDescription = $VisitorHash.GetEnumerator().where{ $_.Value -eq 1 }.Name
And not from the $Permanents
(Cathy Cabbage
) ???
$PermanentsHash = @{}
foreach ($Description in $Permanents.Description) { $PermanentsHash[$Description]++ }
$PermanentsDescription = $PermanentsHash.GetEnumerator().where{ $_.Value -eq 1 }.Name
$PermanentsDescription = $Permanents.Description
Than you might simply use a where
method or even a [Linq.Enumerable]::Intersect
(See also Comparing two arrays & get the values which are not common):
[Linq.Enumerable]::Intersect([string[]]$VisitorDescription, [string[]]$PermanentsDescription)
Cathy Cabbage
Enid Endive
Does the
description
attribute contain the exact name, eg. “John Doe”, or is it a substring search, eg. “This is the description for John Doe”I guess you simply want to join the list (using a hash table, see: what’s the best way to join two tables into one?) and get the unique count of a specific property. Something like:
($visitors | Join-Object $permanents -on Description | Select-Object -Unique Description).Count
@mathias-r-jessen : in principle
Description
contains the entire name, and nothing else. As noted, however, in reality there are some data entry errors: the ones I’ve noticed involveDescription
containing only a single given name (e.g. “John”). (There’s also a separateDisplayName
attribute that would prefix a title of address, as in “Mr. John Doe” / “Dr. John Doe” / “Rev. John Doe” / …, which I am not inclined to use, if I can avoid it.)@DIV Okay, can you update your post to talk a little more about the actual task at hand? You’ve provided a very comprehensive description (which is great), but it’s all in the abstract – knowing what data you’re comparing actually helps here 🙂 Why does permanent staff have visitor names in their account descriptions? Can each staff only have 1 guest at a time? etc.
@iRon : thanks for the suggestion. I’ve had a look at the examples in your code but I’m still not sure whether that’s what I’m trying to do. Did you have a look at the Concept section, and would your proposal produce the same outcome? If you’re confident it matches my use case, then could you please explain a little more how precisely to implement it and what’s happening when
join
ing the lists?Show 6 more comments