NEWS
diyar 0.5.1.9000
New features
unpack_sub_criteria()
flatten_list()
Changes
group_stats now controls which slots are updated for the pid and
epid S4 objects. [Verify].
attr_eval() has been removed. Instead, use
unpack_sub_criteria(x, part = 'attribute') to extract a
sub_criteria attributes before manipulating them as needed.
- New argument (
stepwise_method) in links(). It replaces shrink
and expand.
"expand_with_priority" maps to expand == TRUE & shrink == FALSE.
"ordered_only" maps to expand == FALSE & shrink == FALSE.
"shrink_to_last_match" maps to shrink == TRUE. Please use
stepwise_method moving forward. shrink and expand will be
removed later.
bys_func upgrades. Less memory.
reverse_number_line() upgrades. Less memory.
make_pairs() upgrades. Less memory.
overlap() upgrades. Less memory.
eval_sub_criteria() upgrades. uses less memory.
combi() is now a wrapper function for data.table::frankv
Bug fixes
- The result of
expand_number_line(point = 'start', ...) was incorrect
for descending number lines. Corrected.
diyar 0.5.1 (2023-11-12)
New features
Changes
Bug fixes
links() - Incorrect results in some situations. Resolved.
links_af_probabilistic() - Failed in some situations. Resolved.
diyar 0.5.0 (2023-11-05)
New features
- New option (
"semi") for the batched argument in links(). All
matches are compared against the record-set in the next iteration.
Therefore, the number of record-pairs increase exponentially as new
matches are found. This means fewer record-pairs (memory usage) but a
longer run time compared to the "no" option. Conversely, it leads to
more record-pairs (memory usage) but a shorter run time compared to
the "yes" option.
- New argument (
batched) in episodes()
- New argument (
split) in episodes(). Split the analysis in
N-splits of strata. This leads to fewer record-pairs (and memory
usage) but a longer run time.
- New argument (
decode) in as.data.frame.pid(),
as.data.frame.epid() and as.data.frame.pane()
- New function -
episodes_af_shift(). A more vectorised approach to
episodes() based on epidm::group_time().
- New function -
links_wf_episodes(). Implantation of episodes()
using links().
Changes
- Optimised
episodes() and links(). Each iteration now uses less
time and memory.
link_id slot in pid objects is now a list.
links() - records with missing values in a sub_criteria are now
skipped at the corresponding iteration.
- Updated argument in
links()- recursive. This now takes any of
three options [c("linked", "unlinked", "none")] .
[c("linked", "unlinked")] collectively were previously [TRUE],
while ["none"] was previously [FALSE].
as.epids() now calls make_episodes().
- The default value for the
window argument in partitions() is now
NULL
as.data.frame() and as.data.list() now only creates
elements/fields from non-empty fields
id and gid slots in number_line objects are now integer(0) by
default.
episode_group(), record_group() and range_match_legacy() have
been removed.
["recurisve"] episodes from episodes() are now presented as
["rolling"] episodes with reference_event = "all_records" i.e
Old syntax ~ episodes(..., episode_type == "recursive")
New syntax ~ episodes(..., episode_type == "rolling", reference_event = "all_records")
Bug fixes
- When
recursive was TRUE, links() ended prematurely and therefore
missed some matches. Resolved.
recurrence_sub_criteria in episodes() was not implemented
correctly and lead to incorrect linkage result in some instances.
Resolved.
overlap_method() - logical tests recycled incorrectly. Resolved.
check_links argument - Option "g" implemented as option "l".
Resolved.
make_pairs_wf_source(). Created incorrect pairs. Resolved.
case_sub_criteria and recurrence_sub_criteria in episodes() led
to incorrect results. Resolved.
diyar 0.4.2 (2022-12-20)
New features
- New argument in
merge_ids() - shrink and expand.
- New S3 method for class ‘d_report’ -
plot.
- New S3 method for class ‘sub_criteria’ -
format.
- New function -
true(). Predefined logical test for use with
sub_criteria().
- New function -
false(). Predefined logical test for use with
sub_criteria().
- New argument in
links()- batched. Specify if all record pairs are
created or compared at once ("no") or in batches ("yes").
- New argument in
links()- repeats_allowed. Specify if record-pairs
with duplicate elements should be created.
- New argument in
links()- permutations_allowed. Specify if
permutations of the same record-pair should be created.
- New argument in
links()- ignore_same_source. Specify if
record-pairs from different datasets should be created.
- New argument in
eval_sub_criteria()- depth. First order of
recursion.
- New function -
sets() and make_sets(). Create permutations of
record-sets.
Changes
links() - When shrink is TRUE, records in a record-group must
meet every listed match criteria and sub_criteria. For example, if
pid_cri is 3, then the record must have meet matched another on the
the first three match criteria.
links() - pid@iteration now tracks when a record was dealt with
instead of when it was assigned to a record-group. For example, a
record can be closed (matched or not matched) at iteration 1 but
assigned to a record-group at iteration 5.
make_pairs() - x.* and y.* values in the output are now swapped.
sub_criteria can now export any data created by match_func. To do
this, match_func must export a list, where the first element is a
logical object. See an example below.
library(diyar)
val <- rep(month.abb[1:5], 2); val
#> [1] "Jan" "Feb" "Mar" "Apr" "May" "Jan" "Feb" "Mar" "Apr" "May"
match_and_export <- function(x, y){
output <- list(x == y,
data.frame(x_val = x, y_val = y, is_match = x == y))
return(output)
}
sub.cri.1 <- sub_criteria(
val, match_funcs = list(match.export = match_and_export)
)
format(sub.cri.1, show_levels = TRUE)
#> logical_test-{
#> Lv.0.1-match.export(Jan,Feb,Mar ...)
#> }
eval_sub_criteria(sub.cri.1)
#> $logical_test
#> [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#>
#> $mf.0.1
#> $mf.0.1[[1]]
#> x_val y_val is_match
#> 1 Jan Jan TRUE
#> 2 Feb Jan FALSE
#> 3 Mar Jan FALSE
#> 4 Apr Jan FALSE
#> 5 May Jan FALSE
#> 6 Jan Jan TRUE
#> 7 Feb Jan FALSE
#> 8 Mar Jan FALSE
#> 9 Apr Jan FALSE
#> 10 May Jan FALSE
links can now export any data created within a sub_criteria. To do
this, the sub_criteria must be created as described above. See an
example below
val <- 1:5
diff_one_and_export <- function(x, y){
diff <- x - y
is_match <- diff <= 1
output <- list(is_match,
data.frame(x_val = x, y_val = y, diff = diff, is_match = is_match))
return(output)
}
sub.cri.2 <- sub_criteria(
val, match_funcs = list(diff.export = diff_one_and_export)
)
links(
criteria = "place_holder",
sub_criteria = list("cr1" = sub.cri.2))
#> $pid
#> [1] "P.1 (CRI 001)" "P.1 (CRI 001)" "P.3 (CRI 001)" "P.3 (CRI 001)"
#> [5] "P.5 (Skipped)"
#>
#> $export
#> $export$cri.1
#> $export$cri.1$iteration.1
#> $export$cri.1$iteration.1$mf.0.1
#> $export$cri.1$iteration.1$mf.0.1[[1]]
#> x_val y_val diff is_match
#> 1 1 1 0 TRUE
#> 2 2 1 1 TRUE
#> 3 3 1 2 FALSE
#> 4 4 1 3 FALSE
#> 5 5 1 4 FALSE
#>
#>
#>
#> $export$cri.1$iteration.2
#> $export$cri.1$iteration.2$mf.0.1
#> $export$cri.1$iteration.2$mf.0.1[[1]]
#> x_val y_val diff is_match
#> 1 3 3 0 TRUE
#> 2 4 3 1 TRUE
#> 3 5 3 2 FALSE
Bug fixes
summary.epid() - Incorrect count for ‘by episode type’. Resolved.
episodes() - Incorrect results in some instances with skip_order.
Resolved.
make_ids() - Did not capture all records in that should be in a
record-group when matches are recursive. Resolved.
make_pairs() - Incorrect record-pairs in some instances. Resolved.
eval_sub_criteria() - When output of match_func is length one,
it’s not recycled. Resolved.
reverse_number_line() - Incorrect results in some instances.
Resolved.
links()- Incorrect iteration (pids slot) for non-matches.
Resolved.
links() and episodes() - Timing for each iteration was incorrect.
Resolved.
diyar 0.4.1 (2021-12-05)
New features
- New function -
overlap_method_names(). Overlap methods for a
corresponding overlap method codes.
- Memory usage added to
*with_report options for display.
Changes
"chain" overlap method split into "x_chain_y" and "y_chain_x".
"chain" will continue to be supported as a keyword for
"x_chain_y" OR "y_chain_x" method
"across" overlap method split into "x_across_y" and
"y_across_x". "across" will continue to be supported as a keyword
for "x_across_y" OR "y_across_x" methods
"inbetween" overlap method split into "x_inbetween_y" and
"y_inbetween_x". "inbetween" will continue to be supported as a
keyword for "x_inbetween_y" OR "y_inbetween_x" methods
- Optimised
overlaps().
- Some overlap method codes have changed. Please review any previously
specified codes with
overlap_method_names().
Bug fixes
make_batch_pairs() (internal) created invalid record pairs.
Resolved.
diyar 0.4.0 (2021-11-30)
New features
- New function -
reframe(). Modify the attributes of a sub_criteria
object.
- New function -
link_records(). Record linkage by creating all record
pairs as opposed to batches as with link().
- New function -
make_pairs(). Create every combination of
records-pairs for a given dataset.
- New function -
make_pairs_wf_source(). Create records-pairs from
different sources only.
- New function -
make_ids(). Convert an edge list to a group
identifier.
- New function -
merge_ids(). Merge two group identifiers.
- New function -
attrs(). Pass a set of attributes to one instance of
match_funcs or equal_funcs.
Changes
- Optimised
episodes_wf_splits()
- Optimised
episodes() and links(). Reduced processing times.
- Three new options for the
display argument.
"progress_with_report", "stats_with_report" and
"none_with_report". Creates a d_report; a status of the analysis
over its run time.
eval_sub_criteria(). Record-pairs are no longer created in the
function. Therefore, index_record and sn arguments have been
replaced with x_pos and y_pos.
link_records() and links_wf_probabilistic(). The cmp_threshold
argument has been renamed to attr_threshold.
show_labels argument in schema(). Two new options - "wind_nm"
and "length" to replace "length_label".
Bug fixes
- Incorrect
wind_id list in episodes(..., data_link = "XX") in .
Resolved.
- Incorrect
link_id in links(..., recursive = TRUE). Resolved.
iteration not recorded in some situations with episodes().
Resolved.
skip_order ends an open episode. Resolved.
NA in dist_wind_index and dist_epid_index when sn is supplied.
Resolved.
overlap_method_codes() - overlap method codes not recycled properly.
Resolved.
diyar 0.3.1 (2021-08-09)
New features
- New function -
delink(). Unlink identifiers.
- New function -
episodes_wf_splits(). Wrapper function of
episodes(). Better optimised for handling datasets with many
duplicate records.
- New function -
combi(). Numeric codes for unique combination of
vectors.
- New function -
attr_eval(). Recursive evaluation of a function on
each attribute of a sub_criteria.
Changes
- Two new
case_nm values - Case_CR and Recurrence_CR which are
Case and Recurrence without a sub-criteria match.
Bug fixes
- Corrected length arrows in
schema.epid.
- Corrected outcome of
eval_sub_criteria with 1 result.
diyar 0.3.0 (2021-04-25)
New features
- New function -
links_wf_probabilistic(). Probabilistic record
linkage.
- New function -
partitions(). Spilt events into sections in time.
- New function -
schema(). Plot schema diagrams for pid, epid,
pane and number_line objects.
- New functions -
encode() and decode(). Encode and decode slots
values to minimise memory usage.
- New argument in
episodes() - case_sub_criteria and
recurrence_sub_criteria. Additional matching conditions for temporal
links.
- New argument in
episodes()- case_length_total and
recurrence_length_total. Number of temporal links required for a
window/episode.
- New argument in
links() - recursive. Control if matches can spawn
new matches.
- New argument in
links() - check_duplicates. Control the checking
of logical tests on duplicate values. If FALSE, results are recycled
for the duplicates.
as.data.frame and as.list S3 methods for the pid, number_line,
epid, pane objects.
- New option for
episode_type in episodes() - “recursive”. For
recursive episodes where every linked events can be used as a
subsequent index event.
recurrence_from_last renamed to reference_event and given two new
options.
Changes
episodes() and links(). Speed improvements.
- Default time zone for an
epid_interval or pane_interval with
POSIXct objects is now “GMT”.
number_line_sequence() - splits number_line objects. Also available
as a seq method.
epid_total, pid_total and pane_total slots are populated by
default. No need to used group_stats to get these.
to_df() - Removed. Use as.data.frame() instead.
to_s4() - Now an internal function. It’s no longer exported.
compress_number_line() - Now an internal function. It’s no longer
exported. Use episodes() instead.
sub_criteria() - produces a sub_criteria object. Nested “AND” and
“OR” conditions are now possible.
case_overlap_methods, recurrence_overlap_methods and
overlap_methods now take integer codes for different combinations
of overlap methods. See overlap_methods$options for the full list.
character inputs are still supported.
Bug fixes
"Single-record" was wrong in links summary output. Resolved.
diyar 0.2.0 (2020-09-17)
New features
- Better support for
Inf in number_line objects.
- Can now use multiple
case_length or recurrence_length for the same
event.
- Can now use multiple
overlap_methods for the corresponding
case_length and recurrence_length.
- New function
links() to replace record_group().
- New function
sub_criteria(). The new way of supplying a
sub_criteria in links().
- New functions
exact_match(), range_match() and
range_match_legacy(). Predefined logical tests for use with
sub_criteria(). User-defined tests can also be used. See
?sub_criteria.
- New function
custom_sort() for nested sorting.
- New function
epid_lengths() to show the required case_length or
recurrence_length for an analyses. Useful in confirming the required
case_length or recurrence_length for episode tracking.
- New function
epid_windows(). Shows the period a date will overlap
with given a particular case_length or recurrence_length. Useful
in confirming the required case_length or recurrence_length for
episode tracking.
- New argument -
strata in links(). Useful for stratified data
linkage. As in stratified episode tracking, a record with a missing
strata (NA_character_) is skipped from data linkage.
- New argument -
data_links in links(). Unlink record groups that do
not include records from certain data sources
- New convenience functions
listr(). Format atomic vectors as a written list.
combns(). An extension of combn to generate permutations not
ordinarily captured by combn.
- New
iteration slot for pid and epid objects
- New
overlap_method - reverse()
Changes
number_line() - l and r must have the same length or be 1.
episodes() - case_nm differentiates between duplicates of "Case"
("Duplicate_C") and "Recurrent" events ("Duplicate_R").
- Strata and episode-level options for most arguments. This gives
greater flexibility within the same instance of
episodes().
- Episode-level - The behaviour for each episode is determined by the
corresponding option for its index event (
"Case").
episode_type - simultaneously track both "fixed" and
"rolling" episodes.
skip_if_b4_lengths - simultaneously track episodes where events
before a cut-off range are both skipped and not skipped.
episode_unit - simultaneously track episodes by different units
of time.
case_for_recurrence - simultaneously track "rolling" episodes
with and without an additional case window for recurrent events.
recurrence_from_last - simultaneously track "rolling" episodes
with reference windows calculated from the first and last event of
the previous window.
- Strata-level - The behaviour for each episode is determined by the
corresponding option for its
strata. Options must be the same in
each strata.
from_last - simultaneously track episodes in both directions of
time - past to present and present to past.
episodes_max - simultaneously track different number of episodes
within the dataset.
include_overlap_method - "overlap" and "none" will not be
combined with other methods.
"overlap" - mutually inclusive with the other methods, so their
inclusion is not necessary.
"none" - mutually exclusive and prioritised over the other methods
(including "none"), so their inclusion is not necessary.
- Events can now have missing cut-off points (
NA_real_) or periods
(number_line(NA_real_, NA_real_)) case_length and
recurrence_length. This ensures that the event does not become an
index case however, it can still be part of different episode. For
reference, an event with a missing strata (NA_character_) ensures
that the event does not become an index case nor part of any episode.
Bug fixes
fixed_episodes, rolling_episodes and episode_group -
include_index_period didn’t work in certain situations. Corrected.
fixed_episodes, rolling_episodes and episode_group -
dist_from_wind was wrong in certain situations. Corrected.
diyar 0.1.0 (2020-06-13)
New features
- New argument in
record_group() - strata. Perform record linkage
separately within subsets of a dataset.
- New argument in
overlap(), compress_number_line(),
fixed_sepisodes(), rolling_episodes() and episode_group() -
overlap_methods and methods. Replaces overlap_method and
method respectively. Use different sets of methods within the same
dataset when grouping episodes or collapsing number_line objects.
overlap_method and method only permits 1 method per per dataset.
- New slot in
epid objects - win_nm. Shows the type of window each
event belongs to i.e. case or recurrence window
- New slot in
epid objects - win_id. Unique ID for each window. The
ID is the sn of the reference event for each window
- Format of
epid objects updated to reflect this
- New slot in
epid objects - dist_from_wind. Shows the duration of
each event from its window’s reference event
- New slot in
epid objects - dist_from_epid. Shows the duration of
each event from its episode’s reference event
- New argument in
episode_group() and rolling_episodes() -
recurrence_from_last. Determine if reference events should be the
first or last event from the previous window.
- New argument in
episode_group() and rolling_episodes() -
case_for_recurrence. Determine if recurrent events should have their
own case windows or not.
- New argument in
episode_group(), fixed_episodes() and
rolling_episodes() - data_links. Unlink episodes that do not
include records from certain data_source(s).
episode_group(), fixed_episodes() and rolling_episodes() -
case_length and recurrence_length arguments. You can now use a
range (number_line object).
- New argument in
episode_group(), fixed_episodes() and
rolling_episodes() - include_index_period. If TRUE, overlaps
with the index event or period are grouped together even if they are
outside the cut-off range (case_length or recurrence_length).
- New slot in
pid objects - link_id. Shows the record (sn slot) to
which every record in the dataset has matched to.
- New function -
invert_number_line(). Invert the left and/or
right points to the opposite end of the number line
- New accessor functions -
left_point(x)<-, right_point(x)<-,
start_point(x)<- and end_point(x)<-
Changes
overlap() renamed to overlaps(). overlap() is now a convenience
overlap_method to capture ANY kind of overlap.
"none" is another convenience overlap_method for NO kind of
overlap
expand_number_line() - new options for point; "left" and
"right"
compress_number_line() - compressed number_line object inherits
the direction of the widest number_line among overlapping group of
number_line objects
overlap_methods - have been changed such that each pair of
number_line objects can only overlap in one way. E.g.
"chain" and "aligns_end" used to be possible but this is now
considered a "chain" overlap only
"aligns_start" and "aligns_end" use to be possible but this is
now considered an "exact" overlap
number_line_sequence() - Output is now a list.
number_line_sequence() - now works across multiple number_line
objects.
to_df() - can now change number_line objects to data.frames.
to_s4() can do the reverse.
epid objects are the default outputs for fixed_episodes(),
rolling_episodes() and episode_group()
pid objects are the default outputs for record_group()
- In episode grouping, the
case_nm for events that were skipped due to
rolls_max or episodes_max is now "Skipped".
- In
episode_group() and record_group(), sn can be negative
numbers but must still be unique
- Optimised
episode_group() and record_group(). Runs just a little
bit faster …
- Relaxed the requirement for
x and y to have the same lengths in
overlap functions.
- The behaviour of overlap functions will now be the same as that of
standard R logical tests
episode_group - case_length and recurrence_length arguments. Now
accepts negative numbers.
- negative “lengths” will collapse two periods into one, if the second
one is within some days before the
end_point() of the first
period.
- if the “lengths” are larger than the
number_line_width(), both
will be collapsed if the second one is within some days (or any
other episode_unit) before the start_point() of the first
period.
- cheat sheet updated
Bug fixes
- Recurrence was not checked if the initial case event had no
duplicates. Resolved
case_nm wasn’t right for rolling episodes. Resolved
diyar 0.0.3 (2019-12-08)
Changes
- #7
episode_group(), fixed_episodes() and rolling_episodes() -
optimized to take less time when working with large datasets
episode_group(), fixed_episodes() and rolling_episodes() -
date argument now supports numeric values
compress_number_line() - the output (gid slot) is now a group
identifier just like in epid objects (epid_interval)
diyar 0.0.2 (2019-11-11)
New feature
pid S4 object class for results of record_group(). This will
replace the current default (data.frame) in the next major release
epid S4 object class for results of episode_group(),
fixed_episodes() and rolling_episodes(). This will replace the
current default (data.frame) in the next release
to_s4() and to_s4 argument in record_group(), episode_group(),
fixed_episodes() and rolling_episodes(). Changes their output from
a data.frame (current default) to epid or pid objects
to_df() changes epid or pid objects to a data.frame
deduplicate argument from fixed_episodes() and
rolling_episodes() added to episode_group()
Changes
fixed_episodes() and rolling_episodes() are now wrapper functions
of episode_group(). Functionality remains the same but now includes
all arguments available to episode_group()
- Changed the output of
fixed_episodes() and rolling_episodes() from
number_line to data.frame, pending the change to epid objects
pid_cri column returned in record_group is now numeric. 0
indicates no match.
- columns can now be used as
criteria multiple times record_group()
- #6
number_line
objects can now be used as a criteria in record_group()
Bug fixes
- #3 - Resolved a bug
with
episode_unit in episode_group()
- #4 - Resolved a bug
with
bi_direction in episode_group()
diyar 0.0.1 (2019-10-06)
Features
fixed_episodes() and rolling_episodes() - Group records into fixed
or rolling episodes of events or period of events.
episode_group() - A more comprehensive implementation of
fixed_episodes() and rolling_episodes(), with additional features
such as user defined case assignment.
record_group() - Multistage deterministic linkage that addresses
missing data.
number_line S4 object.
- Used to represent a range of numeric values to match using
record_group()
- Used to represent a period in time to be grouped using
fixed_episodes(), rolling_episodes() and episode_group()
- Used as the returned output of
fixed_episodes() and
rolling_episodes()