dfPower Studio · DMS · Data Jobs · DQ Schemes · Match Rules

From SAS DataFlux to modern data quality

MigryX parses every dfPower Studio and DMS job file — standardize, parse, match, encode, validate, and profile operations — and converts them to idiomatic Python, Snowflake UDFs, Databricks PySpark pipelines, and dbt tests. All DQ logic. Zero rewrites.

Python Snowflake Databricks dbt PySpark
DataFlux → Modern DQ
dfPower Studio .dfm filesPython pandas pipelines
Standardize Schemesusaddress / nameparser
Match / Cluster Rulespy-recordlinkage / dedupe
Encode (Phonetic)phonetics library
Profile & Validate RulesGreat Expectations
DQPARSE / DQSTANDARDIZESnowflake Python UDFs
Process Jobs (Orchestration)Airflow / Databricks WF
Parser Engine

Everything MigryX reads and converts

A purpose-built parser ingests every DataFlux and DMS artifact — from .dfm job files and DQ scheme definitions to SAS code calling DQPARSE() — and emits production-ready modern equivalents.

DataFlux Sources
  • dfPower Studio Jobs (.dfm files)
  • DMS Data Jobs (data flow canvases)
  • Process Jobs (orchestration chains)
  • Real-time Services (Web Service nodes)
  • Standardize Schemes (address, name, date, phone, custom)
  • Parse Schemes (name/address field splitting, token extraction, pattern recognition)
  • Match / Cluster Rules (deterministic + probabilistic match keys)
  • Encode (Phonetic) Schemes (Soundex, NYSIIS, Metaphone, Double Metaphone)
  • Profile Jobs (pattern analysis, completeness, cardinality)
  • Validate Rules (regex, reference data, domain & range checks)
  • DQ Repository (locales, schemes, reference tables)
  • SAS DQ Functions (DQPARSE, DQSTANDARDIZE, DQMATCH, DQGENDER, DQCASE, DQTOKENIZE, DQSCHEME)
  • Reference Data Tables (locale-specific: US, UK, Canada, Germany…)
  • Job Chains & Schedules
Modern Targets
  • Python (pandas + open-source DQ libraries)
  • py-recordlinkage (deterministic + probabilistic matching)
  • dedupe (unsupervised clustering & entity resolution)
  • usaddress (address parsing & standardization)
  • nameparser (name parsing, title, suffix, gender)
  • phonetics (Soundex, NYSIIS, Metaphone, Double Metaphone)
  • Great Expectations (validation suites & data profiling)
  • ydata-profiling (statistical profiling reports)
  • Cerberus (schema validation)
  • Snowflake Python UDFs / JS UDFs
  • Databricks PySpark + Delta Lake DQ
  • dbt Tests & dbt-expectations
  • Apache Spark custom DQ transformations
  • Airflow / Databricks Workflows (orchestration)
Methodology

Three phases from DataFlux to production

A structured, parser-driven approach that inventories every artifact, converts each DQ operation class-by-class, then validates output parity before cutover.

1

Analyze

Full inventory and complexity profiling of every DataFlux artifact before any output code is generated.

  • Inventory all .dfm job files and DMS job canvases
  • Classify DQ operations: standardize, parse, match, encode, profile, validate
  • Extract scheme references and locale dependencies
  • Map DQ Repository: locales, reference tables, custom schemes
  • Profile complexity of match rules (key count, blocking strategy, threshold analysis)
  • Identify SAS DQ function calls in embedded SAS code (DQPARSE, DQSTANDARDIZE, DQMATCH, DQGENDER, DQCASE, DQTOKENIZE, DQSCHEME)
  • Detect Real-time Service endpoints and job chain dependencies
  • Generate migration complexity scorecard per job
2

Convert

Operation-class-aware code generation preserving all DQ logic with idiomatic open-source equivalents.

  • Standardize schemes → Python normalization pipelines (usaddress, nameparser, dateutil, custom regex)
  • Parse schemes → regex patterns + NLP parsers (spaCy, nameparser, usaddress)
  • Match rules (deterministic) → py-recordlinkage exact-key comparisons
  • Match rules (probabilistic/fuzzy) → py-recordlinkage / dedupe configurations
  • Encode schemes → phonetics library (Soundex, NYSIIS, Metaphone)
  • Validate rules → Great Expectations expectation suites
  • Profile jobs → ydata-profiling reports + GE profiling
  • SAS DQ functions → Snowflake Python UDFs or equivalent Python calls
  • Process Jobs → Airflow DAGs / Databricks Workflow JSON
  • Real-time Services → FastAPI endpoints wrapping DQ functions
3

Validate

Side-by-side output comparison across a representative data sample before decommissioning DataFlux.

  • Compare DQ output samples: original DataFlux output vs. migrated code output
  • Match rate parity testing (precision, recall, F1 on matched/unmatched record sets)
  • Standardization output comparison (field-by-field diff on address, name, date outputs)
  • Encode key equivalence testing (phonetic code output comparison)
  • Validation rule coverage audit (every original rule represented in GE suite)
  • Profile metric parity (completeness %, pattern distribution, cardinality)
  • End-to-end job timing benchmarks
  • Sign-off report with diff summary per job
Capabilities

What MigryX handles for DataFlux

📄

DataFlux Job Parsing (DFM Format)

Structural parser for .dfm binary/XML job files used by dfPower Studio and DMS. Reads nodes, edges, scheme references, locale bindings, and job metadata with full fidelity before any conversion step begins.

Standardize Scheme Migration

Converts address standardization (USPS CASS-style), name parsing & standardization, date/phone/fax formatting, and custom standardization schemes to Python normalization pipelines using usaddress, nameparser, and regex equivalents.

🔗

Match Rule Conversion (Deterministic + Fuzzy)

Translates DataFlux match keys, blocking rules, frequency analysis tables, and probabilistic thresholds into py-recordlinkage comparison vectors or dedupe training configurations — preserving precision and recall targets.

🔨

Parse Scheme to Python

Reverse-engineers DataFlux parse scheme logic — field splitting, token extraction, pattern recognition — into equivalent Python regular expressions, spaCy NLP rules, and structured parser calls (nameparser, usaddress).

📊

Profile & Validate Migration

Maps DataFlux Profile job configurations to ydata-profiling and Great Expectations profiling runs. Converts validate rules (regex, reference lookup, domain/range) into GE Expectation Suites and dbt-expectations tests.

🌎

DQ Repository Translation

Exports DFM repository artifacts — locale-specific schemes (US, UK, Canada, Germany), reference tables, and custom phonetic encoding schemes — into portable Python dictionaries, CSV lookup tables, and Snowflake staging tables.

Conversion Map

DataFlux operation to modern equivalent

DataFlux Operation Artifact / Format Python / Open-Source Target Cloud Target
Standardize — AddressStandardize Scheme (US/UK/CA locale)usaddress + custom normalizerSnowflake Python UDF
Standardize — NameName standardization schemenameparser HumanNameSnowflake Python UDF
Standardize — Date / PhoneDate / phone formatting schemedateutil, phonenumbersSnowflake JS UDF
Parse — Name / AddressParse Scheme (.dfm node)nameparser, usaddressDatabricks UDF
Parse — Custom tokensCustom parse scheme patternsre + spaCy rulerSnowflake Python UDF
Match — DeterministicExact match keyspy-recordlinkage Compare.exact()dbt test / Snowflake SQL
Match — ProbabilisticFuzzy match rules + thresholdspy-recordlinkage / dedupeDatabricks PySpark ML
Encode — PhoneticSoundex, NYSIIS, Metaphone schemesphonetics librarySnowflake JS UDF (soundex)
ProfileProfile job nodesydata-profiling + Great ExpectationsDatabricks profiling notebook
Validate — RegexValidate rule (pattern match)Great Expectations expect_column_values_to_match_regexdbt-expectations
Validate — Reference LookupReference data table lookupGreat Expectations expect_column_values_to_be_in_setdbt test / Snowflake constraint
DQPARSE()SAS DQ function in DATA stepnameparser / usaddressSnowflake Python UDF
DQSTANDARDIZE()SAS DQ function in DATA stepCustom normalizer Python functionSnowflake Python UDF
DQMATCH()SAS DQ function in DATA steppy-recordlinkage match scoreDatabricks PySpark UDF
Process Job (Orchestration)Job chain / schedule / event triggerApache Airflow DAG (Python)Databricks Workflow JSON
Real-time ServiceWeb Service node (.dfm)FastAPI endpoint wrapping DQ functionsAWS Lambda / Azure Function
Source Artifacts

Every DataFlux artifact MigryX ingests

dfPower Studio Jobs (.dfm) DMS Data Jobs Process Jobs (Orchestration) Real-time Service definitions Standardize Schemes Parse Schemes Match / Cluster Rules Encode (Phonetic) Schemes Profile Job configs Validate Rules DQ Repository (locales) Reference Data Tables SAS DQ Function: DQPARSE SAS DQ Function: DQSTANDARDIZE SAS DQ Function: DQMATCH SAS DQ Function: DQGENDER SAS DQ Function: DQCASE SAS DQ Function: DQTOKENIZE SAS DQ Function: DQSCHEME Job Chains & Schedules ODBC / JDBC connectors SAS datasets (.sas7bdat) Flat files & delimited files XML data sources Locale configs (US / UK / CA / DE) Frequency analysis tables Blocking strategy configs Threshold / weight tables
Migration Targets

Modern platforms where your DQ logic lands

Python 3.x (pandas) py-recordlinkage dedupe usaddress nameparser phonetics (Soundex / NYSIIS) Great Expectations ydata-profiling Cerberus Snowflake Python UDFs Snowflake JS UDFs Databricks (PySpark) Delta Lake DQ expectations dbt Tests dbt-expectations Apache Spark (custom transforms) Apache Airflow DAGs Databricks Workflows FastAPI (Real-time DQ services) AWS Lambda Azure Functions
DataFlux Product / Concept MigryX Migration Scope Primary Target Secondary Target
dfPower StudioAll .dfm job files, DQ nodes, scheme bindingsPython + Great ExpectationsSnowflake UDFs
SAS Data Management Studio (DMS)Data Jobs, Process Jobs, job canvas metadataPython pipelines + AirflowDatabricks Workflows
SAS Data Quality ServerDQ schemes, locales, reference tablesPython + open-source DQ libsSnowflake Python UDFs
DataFlux DMPEnd-to-end job orchestration, schedulesAirflow DAGsDatabricks Workflows
Real-time ServicesWeb service endpoint definitions, DQ functionsFastAPI microservicesAWS Lambda